Thread: Why copy_relation_data only use wal when WAL archiving is enabled
If I run the database under non-archiving mode, and execute the following command: alter table t set tablespace tblspc1; Isn't it possible that the "new t" cann't be recovered?
Jacky Leng wrote: > If I run the database under non-archiving mode, and execute the following > command: > alter table t set tablespace tblspc1; > Isn't it possible that the "new t" cann't be recovered? No. At the end of copy_relation_data we call smgrimmedsync, which fsyncs the new relation file. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> Jacky Leng wrote: >> If I run the database under non-archiving mode, and execute the following >> command: >> alter table t set tablespace tblspc1; >> Isn't it possible that the "new t" cann't be recovered? > > No. At the end of copy_relation_data we call smgrimmedsync, which fsyncs > the new relation file. Usually it's true, but how about this situation: * First, do the following series: * Create two tablespace SPC1, SPC2; * Create table T1 in SPC1 and insert some valuesinto it, suppose T1's oid/relfilenode is OID1; * Drop table T1;----------OID1 was released in pg_class and can be reused. * Do anything that will make the next oid that'll be allocated from pg_class be OID1, e.g. insert many many tuples into a relation with oid; * Create table T2 in SPC2, and insert somevalues into it, and its oid/relfilenode is OID1; * Alter table T2 set tablespace SPC1;---------T2 goes to SPC1 and uses the same file name with old T1; * Second, suppose that no checkpoint has occured during the upper series--authough not quite possible; * Kill the database abnormaly; * Restart the database; Let's analyze what will happen during the recovery process: * When T1 is re-created, it finds that its file has already been there--actually this file is T2's; * "T1" ' s file(actually T2's) is re-dropped; * .... * T2 is re-created, and finds that its file has disappeared, so it re-create one; * As copy_relation_data didn't record any xlog about T2's AlterTableSpace op, after recovery, we'll find that T2 is empty!!! > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq >
On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote: > Second, suppose that no checkpoint has occured during the upper > series--authough not quite possible; That part is irrelevant. It's forced out to disk and doesn't need recovery, with or without the checkpoint. There's no hole that I can see. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote: >> Second, suppose that no checkpoint has occured during the upper >> series--authough not quite possible; > > That part is irrelevant. It's forced out to disk and doesn't need > recovery, with or without the checkpoint. > > There's no hole that I can see. No, Jacky is right. The same problem exists at least with CLUSTER, and I think there's other commands that rely on immediate fsync as well. Attached is a shell script that demonstrates the problem on CVS HEAD with CLUSTER. It creates two tables, T1 and T2, both with one row. Then T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file happens to get the same relfilenode that T1 had. Then we crash the server, forcing a WAL replay. After that, T2 is empty. Oops. Unfortunately I don't see any easy way to fix it. One approach would be to avoid reusing the relfilenodes until next checkpoint, but I don't see any nice place to keep track of OIDs that have been dropped since last checkpoint. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote: >> Second, suppose that no checkpoint has occured during the upper >> series--authough not quite possible; > > That part is irrelevant. It's forced out to disk and doesn't need > recovery, with or without the checkpoint. > > There's no hole that I can see. Yes, it's really forced out. But if there's no checkpoint, the recovery process will begin from the time point before T1 is created, and as T1 was dropped, it'll remove T2's file! > -- > Simon Riggs > 2ndQuadrant http://www.2ndQuadrant.com > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
Forgot to attach the script I promised.. You need to set $PGDATA before running the script. And psql,pg_ctl and pg_resetxlog need to be in $PATH. After running the script, restart postmaster and run "SELECT * FROM t2". There should be one row in the table, but it's empty. Heikki Linnakangas wrote: > Simon Riggs wrote: >> On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote: >>> Second, suppose that no checkpoint has occured during the upper >>> series--authough not quite possible; >> That part is irrelevant. It's forced out to disk and doesn't need >> recovery, with or without the checkpoint. >> >> There's no hole that I can see. > > No, Jacky is right. The same problem exists at least with CLUSTER, and I > think there's other commands that rely on immediate fsync as well. > > Attached is a shell script that demonstrates the problem on CVS HEAD > with CLUSTER. It creates two tables, T1 and T2, both with one row. Then > T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file > happens to get the same relfilenode that T1 had. Then we crash the > server, forcing a WAL replay. After that, T2 is empty. Oops. > > Unfortunately I don't see any easy way to fix it. One approach would be > to avoid reusing the relfilenodes until next checkpoint, but I don't see > any nice place to keep track of OIDs that have been dropped since last > checkpoint. > -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
I wrote: > Unfortunately I don't see any easy way to fix it. One approach would be > to avoid reusing the relfilenodes until next checkpoint, but I don't see > any nice place to keep track of OIDs that have been dropped since last > checkpoint. Ok, here's one idea: Instead of deleting the file immediately on commit of DROP TABLE, the file is truncated to release the space, but not unlink()ed, to avoid reusing that relfilenode. The truncated file can be deleted after next checkpoint. Now, how does checkpoint know what to delete? We can use the fsync request mechanism for that. When a file is truncated, a new kind of fsync request, a "deletion request", is sent to the bgwriter, which collects all such requests to a list. Before checkpoint calculates new RedoRecPtr, the list is swapped with an empty one, and after writing the new checkpoint record, all the files that were in the list are deleted. We would leak empty files on crashes, but we leak files on crashes anyway, so that shouldn't be an issue. This scheme wouldn't require catalog changes, so it would be suitable for backpatching. Any better ideas? Do we care enough about this to fix this? Enough to backpatch? The probability of this happening is pretty small, but the consequences are really bad, so my vote is "yes" and "yes". -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > I wrote: >> Unfortunately I don't see any easy way to fix it. One approach would be >> to avoid reusing the relfilenodes until next checkpoint, but I don't see >> any nice place to keep track of OIDs that have been dropped since last >> checkpoint. > > Ok, here's one idea: > > Instead of deleting the file immediately on commit of DROP TABLE, the > file is truncated to release the space, but not unlink()ed, to avoid > reusing that relfilenode. The truncated file can be deleted after next > checkpoint. > > Now, how does checkpoint know what to delete? We can use the fsync > request mechanism for that. When a file is truncated, a new kind of > fsync request, a "deletion request", is sent to the bgwriter, which > collects all such requests to a list. Before checkpoint calculates new > RedoRecPtr, the list is swapped with an empty one, and after writing the > new checkpoint record, all the files that were in the list are deleted. > > We would leak empty files on crashes, but we leak files on crashes > anyway, so that shouldn't be an issue. This scheme wouldn't require > catalog changes, so it would be suitable for backpatching. > > Any better ideas? Couldn't we fix this by forcing a checkpoint before we commit the transaction that created the new pg_class entry for the clustered table? Or rather, more generally, before committing a transaction that created a new non-temporary relfilenode but didn't WAL-log any subsequent inserts. Thats of course a rather sledgehammer-like approach to this problem - but at least for the backbranched the fix would be less intrusive... regards, Florian Pflug
On Wed, 2007-10-17 at 12:11 +0100, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote: > >> Second, suppose that no checkpoint has occured during the upper > >> series--authough not quite possible; > > > > That part is irrelevant. It's forced out to disk and doesn't need > > recovery, with or without the checkpoint. > > > > There's no hole that I can see. > > No, Jacky is right. The same problem exists at least with CLUSTER, and I > think there's other commands that rely on immediate fsync as well. > > Attached is a shell script that demonstrates the problem on CVS HEAD > with CLUSTER. It creates two tables, T1 and T2, both with one row. Then > T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file > happens to get the same relfilenode that T1 had. Then we crash the > server, forcing a WAL replay. After that, T2 is empty. Oops. > > Unfortunately I don't see any easy way to fix it. So, what you are saying is that re-using relfilenodes can cause problems during recovery in any command that alters the relfilenode of a relation? If you've got a better problem statement it would be good to get that right first before we discuss solutions. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > On Wed, 2007-10-17 at 12:11 +0100, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote: >>>> Second, suppose that no checkpoint has occured during the upper >>>> series--authough not quite possible; >>> That part is irrelevant. It's forced out to disk and doesn't need >>> recovery, with or without the checkpoint. >>> >>> There's no hole that I can see. >> No, Jacky is right. The same problem exists at least with CLUSTER, and I >> think there's other commands that rely on immediate fsync as well. >> >> Attached is a shell script that demonstrates the problem on CVS HEAD with >> CLUSTER. It creates two tables, T1 and T2, both with one row. Then T1 is >> dropped, and T2 is CLUSTERed, so that the new T2 relation file happens to >> get the same relfilenode that T1 had. Then we crash the server, forcing a >> WAL replay. After that, T2 is empty. Oops. >> >> Unfortunately I don't see any easy way to fix it. > > So, what you are saying is that re-using relfilenodes can cause problems > during recovery in any command that alters the relfilenode of a relation? For what I understand, I'd say that creating a relfilenode *and* subsequently inserting data without WAL-logging causes the problem. If the relfilenode was recently deleted, the inserts might be effectively undone upon recovery (because we first replay the delete), but later *not* redone (because we didn't WAL-log the inserts). That brings me to another idea from a fix that is less heavyweight than my previous checkpoint-before-commit suggestion. We could make relfilenodes globally unique if we added the xid and epoch of the creating transaction to the filename. Those are 64 bits, so if we encode them in base 36 (using A-Z,0-9), that'd increase the length of the filenames by 13. regards, Florian Pflug
Simon Riggs wrote: > If you've got a better problem statement it would be good to get that > right first before we discuss solutions. Reusing a relfilenode of a deleted relation, before next checkpoint following the commit of the deleting transaction, for an operation that doesn't WAL log the contents of the new relation, leads to data loss on recovery. Or Performing non-WAL logged operations on a relation file leads to a truncated file on recovery, if the relfilenode of that file used to belong to a relation that was dropped after the last checkpoint. Happy? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Florian G. Pflug wrote: > Heikki Linnakangas wrote: >> I wrote: >>> Unfortunately I don't see any easy way to fix it. One approach would be >>> to avoid reusing the relfilenodes until next checkpoint, but I don't see >>> any nice place to keep track of OIDs that have been dropped since last >>> checkpoint. >> >> Ok, here's one idea: >> >> Instead of deleting the file immediately on commit of DROP TABLE, the >> file is truncated to release the space, but not unlink()ed, to avoid >> reusing that relfilenode. The truncated file can be deleted after next >> checkpoint. >> >> Now, how does checkpoint know what to delete? We can use the fsync >> request mechanism for that. When a file is truncated, a new kind of >> fsync request, a "deletion request", is sent to the bgwriter, which >> collects all such requests to a list. Before checkpoint calculates new >> RedoRecPtr, the list is swapped with an empty one, and after writing the >> new checkpoint record, all the files that were in the list are deleted. >> >> We would leak empty files on crashes, but we leak files on crashes >> anyway, so that shouldn't be an issue. This scheme wouldn't require >> catalog changes, so it would be suitable for backpatching. >> >> Any better ideas? > Couldn't we fix this by forcing a checkpoint before we commit the > transaction that created the new pg_class entry for the clustered table? > Or rather, more generally, before committing a transaction that created > a new non-temporary relfilenode but didn't WAL-log any subsequent inserts. Yes, that would work. As a small optimization, you could set a flag in shared mem whenever you delete a rel file, and skip the checkpoint when that flag isn't set. > Thats of course a rather sledgehammer-like approach to this problem - > but at least for the backbranched the fix would be less intrusive... Too much of a sledgehammer IMHO. BTW, CREATE INDEX is also vulnerable. And in 8.3, COPY to a table created/truncated in the same transaction. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2007-10-17 at 15:02 +0100, Heikki Linnakangas wrote: > Simon Riggs wrote: > > If you've got a better problem statement it would be good to get that > > right first before we discuss solutions. > > Reusing a relfilenode of a deleted relation, before next checkpoint > following the commit of the deleting transaction, for an operation that > doesn't WAL log the contents of the new relation, leads to data loss on > recovery. OK, thanks. I wasn't aware we reused refilenode ids. The code in GetNewOid() doesn't look deterministic to me, or at least isn't meant to be. GetNewObjectId() should be cycling around, so although the oid index scan using SnapshotDirty won't see committed deleted rows that shouldn't matter for 2^32 oids. So what gives? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > On Wed, 2007-10-17 at 15:02 +0100, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> If you've got a better problem statement it would be good to get that >>> right first before we discuss solutions. >> Reusing a relfilenode of a deleted relation, before next checkpoint >> following the commit of the deleting transaction, for an operation that >> doesn't WAL log the contents of the new relation, leads to data loss on >> recovery. > > OK, thanks. > > I wasn't aware we reused refilenode ids. The code in GetNewOid() doesn't > look deterministic to me, or at least isn't meant to be. > GetNewObjectId() should be cycling around, so although the oid index > scan using SnapshotDirty won't see committed deleted rows that shouldn't > matter for 2^32 oids. So what gives? I don't think you still quite understand what's happening. GetNewOid() is not interesting here, look at GetNewRelFileNode() instead. And neither are snapshots or MVCC visibility rules. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > I don't think you still quite understand what's happening. GetNewOid() > is not interesting here, look at GetNewRelFileNode() instead. And > neither are snapshots or MVCC visibility rules. Simon has a legitimate objection; not that there's no bug, but that the probability of getting bitten is exceedingly small. The test script you showed cheats six-ways-from-Sunday to cause an OID collision that would never happen in practice. The only case where it would really happen is if a table that has existed for a long time (~ 2^32 OID creations) gets dropped and then you're unlucky enough to recycle that exact OID before the next checkpoint --- and then crash before the checkpoint. I think we should think about ways to fix this, but I don't feel a need to try to backpatch a solution. I tend to agree that truncating the file, and extending the fsync request mechanism to actually delete it after the next checkpoint, is the most reasonable route to a fix. I think the objection about leaking files on crash is wrong. We'd have the replay of the deletion to fix things up --- it could probably delete the file immediately, and if not could certainly put it back on the fsync request queue. regards, tom lane
Tom Lane wrote: > Simon has a legitimate objection; not that there's no bug, but that the > probability of getting bitten is exceedingly small. Oh, if that's what he meant, he's right. > The test script you > showed cheats six-ways-from-Sunday to cause an OID collision that would > never happen in practice. The only case where it would really happen > is if a table that has existed for a long time (~ 2^32 OID creations) > gets dropped and then you're unlucky enough to recycle that exact OID > before the next checkpoint --- and then crash before the checkpoint. Yeah, it's unlikely to happen, but the consequences are horrible. Note that it's not just DROP TABLE that's a problem, but anything that uses smgrscheduleunlink, including CLUSTER and REINDEX. > I tend to agree that truncating the file, and extending the fsync > request mechanism to actually delete it after the next checkpoint, > is the most reasonable route to a fix. Ok, I'll write a patch to do that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2007-10-17 at 17:36 +0100, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Wed, 2007-10-17 at 15:02 +0100, Heikki Linnakangas wrote: > >> Simon Riggs wrote: > >>> If you've got a better problem statement it would be good to get that > >>> right first before we discuss solutions. > >> Reusing a relfilenode of a deleted relation, before next checkpoint > >> following the commit of the deleting transaction, for an operation that > >> doesn't WAL log the contents of the new relation, leads to data loss on > >> recovery. > > > > OK, thanks. > > > > I wasn't aware we reused refilenode ids. The code in GetNewOid() doesn't > > look deterministic to me, or at least isn't meant to be. > > GetNewObjectId() should be cycling around, so although the oid index > > scan using SnapshotDirty won't see committed deleted rows that shouldn't > > matter for 2^32 oids. So what gives? > > I don't think you still quite understand what's happening. Clearly. It's not a problem to admit that. > GetNewOid() > is not interesting here, look at GetNewRelFileNode() instead. And > neither are snapshots or MVCC visibility rules. Which calls GetNewOid() in all cases, AFAICS. How does the reuse you say is happening come about? Seems like the bug is in the reuse, not in how we cope with potential reuse. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > On Wed, 2007-10-17 at 17:36 +0100, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> On Wed, 2007-10-17 at 15:02 +0100, Heikki Linnakangas wrote: >>>> Simon Riggs wrote: >>>>> If you've got a better problem statement it would be good to get that >>>>> right first before we discuss solutions. >>>> Reusing a relfilenode of a deleted relation, before next checkpoint >>>> following the commit of the deleting transaction, for an operation that >>>> doesn't WAL log the contents of the new relation, leads to data loss on >>>> recovery. >>> OK, thanks. >>> >>> I wasn't aware we reused refilenode ids. The code in GetNewOid() doesn't >>> look deterministic to me, or at least isn't meant to be. >>> GetNewObjectId() should be cycling around, so although the oid index >>> scan using SnapshotDirty won't see committed deleted rows that shouldn't >>> matter for 2^32 oids. So what gives? >> I don't think you still quite understand what's happening. > > Clearly. It's not a problem to admit that. > >> GetNewOid() >> is not interesting here, look at GetNewRelFileNode() instead. And >> neither are snapshots or MVCC visibility rules. > > Which calls GetNewOid() in all cases, AFAICS. > > How does the reuse you say is happening come about? Seems like the bug > is in the reuse, not in how we cope with potential reuse. After a table is dropped, the dropping transaction has been committed, and the relation file has been deleted, there's nothing preventing the reuse. There's no trace of that relfilenode in the system (except in the WAL, which we never look into except on WAL replay). There's a dead row in pg_class with that relfilenode, but even that could be vacuumed away (not that it matters because we don't examine that). Now the problem is that there's a record in the WAL to delete a relation file with that relfilenode. If that relfilenode was reused, we delete the contents of the new relation file when we replay that WAL record. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2007-10-17 at 18:13 +0100, Heikki Linnakangas wrote: > > The test script you > > showed cheats six-ways-from-Sunday to cause an OID collision that would > > never happen in practice. The only case where it would really happen > > is if a table that has existed for a long time (~ 2^32 OID creations) > > gets dropped and then you're unlucky enough to recycle that exact OID > > before the next checkpoint --- and then crash before the checkpoint. > > Yeah, it's unlikely to happen, but the consequences are horrible. When is this going to happen? We'd need to insert 2^32 toast chunks, which is >4 TB of data, or insert 2^32 large objects, or create 2^32 tables, or any combination of the above all within one checkpoint duration *and* exactly hit the exact same relation. That's a weird and huge application, a very fast server and an unlucky DBA to hit the exact OID to be reused and then have the server crash so we'll ever notice. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs wrote: > On Wed, 2007-10-17 at 18:13 +0100, Heikki Linnakangas wrote: > >>> The test script you >>> showed cheats six-ways-from-Sunday to cause an OID collision that would >>> never happen in practice. The only case where it would really happen >>> is if a table that has existed for a long time (~ 2^32 OID creations) >>> gets dropped and then you're unlucky enough to recycle that exact OID >>> before the next checkpoint --- and then crash before the checkpoint. >> Yeah, it's unlikely to happen, but the consequences are horrible. > > When is this going to happen? > > We'd need to insert 2^32 toast chunks, which is >4 TB of data, or insert > 2^32 large objects, or create 2^32 tables, or any combination of the > above all within one checkpoint duration *and* exactly hit the exact > same relation. The consumption of the OIDs don't need to happen within one checkpoint duration. As long as the DROP and the reuse happens in the same checkpoint cycle, you're screwed. Granted that you're not likely to ever experience OID wrap-around unless you have a heavily used user table with OIDs. Or create/drop temp tables a lot. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> You need to set $PGDATA before running the script. And psql,pg_ctl and > pg_resetxlog need to be in $PATH. After running the script, restart > postmaster and run "SELECT * FROM t2". There should be one row in the > table, but it's empty. I've tried this script, and superisingly found that T2 is not empty, just as it should be. Then I see how cluster is done, and found that
> Heikki Linnakangas <heikki@enterprisedb.com> writes: > I tend to agree that truncating the file, and extending the fsync > request mechanism to actually delete it after the next checkpoint, > is the most reasonable route to a fix. > How about just allowing to use wal even WAL archiving is disabled? It seems that recovery of "XLOG_HEAP_NEWPAGE" record will do the right thing for us, look at "heap_xlog_newpage": XLogReadBuffer with init=true will extend the block rightly and rebuild it rightly. Someone may say that it's not worth recording xlog for operations such as copy_relation_data, but these operations shouldn't happen frequently.
Sorry, send the mail wrongly just now. > You need to set $PGDATA before running the script. And psql,pg_ctl and > pg_resetxlog need to be in $PATH. After running the script, restart > postmaster and run "SELECT * FROM t2". There should be one row in the > table, but it's empty. I've tried this script on "postgres (PostgreSQL) 8.3devel", and found that T2 is not empty after recovery(just as it should be)---but the latest version act just like what you said. Then I see how cluster is done, and found that: In "postgres (PostgreSQL) 8.3devel", unlike AlterTableSetTablespace (which copys the whole relation block-by-block, and doesn't use wal under non-archiving mode), Cluster copys the relation row-by-row(simple_heap_insert), which always uses wal regardless of archiving mode. As wal exists, recovery will cope with everything rightly. The latest version acts differently probably because that it removes wal of cluser under non-archiving mode. So the conclusion is: we can replace wal mechanism with smgrimmedsync only if relfilenode is not allowed to be reused, but this's not true, so what we should keep wal. Is it right?
> You need to set $PGDATA before running the script. And psql,pg_ctl and > pg_resetxlog need to be in $PATH. After running the script, restart > postmaster and run "SELECT * FROM t2". There should be one row in the > table, but it's empty. I've tried this script on "postgres (PostgreSQL) 8.3devel", and found that T2 is not empty after recovery(just as it should be)---but the latest version act just like what you said. Then I see how cluster is done, and found that: In "postgres (PostgreSQL) 8.3devel", unlike AlterTableSetTablespace (which copys the whole relation block-by-block, and doesn't use wal under non-archiving mode), Cluster copys the relation row-by-row(simple_heap_insert), which always uses wal regardless of archiving mode. As wal exists, recovery will cope with everything rightly. The latest version acts differently probably because that it removes wal of cluser under non-archiving mode. So the conclusion is: we can replace wal mechanism with smgrimmedsync only if relfilenode is not allowed to be reused, but this's not true, so what we should keep wal. Is it right?
> You need to set $PGDATA before running the script. And psql,pg_ctl and > pg_resetxlog need to be in $PATH. After running the script, restart > postmaster and run "SELECT * FROM t2". There should be one row in the > table, but it's empty. I've tried this script on "postgres (PostgreSQL) 8.3devel", and found that T2 is not empty after recovery(just as it should be)---but the latest version act just like what you said. Then I see how cluster is done, and found that: In "postgres (PostgreSQL) 8.3devel", unlike AlterTableSetTablespace (which copys the whole relation block-by-block, and doesn't use wal under non-archiving mode), Cluster copys the relation row-by-row(simple_heap_insert), which always uses wal regardless of archiving mode. As wal exists, recovery will cope with everything rightly. The latest version acts differently probably because that it removes wal of cluser under non-archiving mode. So the conclusion is: we can replace wal mechanism with smgrimmedsync only if relfilenode is not allowed to be reused, but this's not true, so what we should keep wal. Is it right?
Jacky Leng wrote: >> I tend to agree that truncating the file, and extending the fsync >> request mechanism to actually delete it after the next checkpoint, >> is the most reasonable route to a fix. > > How about just allowing to use wal even WAL archiving is disabled? > It seems that recovery of "XLOG_HEAP_NEWPAGE" record will do the > right thing for us, look at "heap_xlog_newpage": XLogReadBuffer > with init=true will extend the block rightly and rebuild it rightly. > > Someone may say that it's not worth recording xlog for operations > such as copy_relation_data, but these operations shouldn't happen > frequently. Always using WAL would fix the problem, but it's a big performance hit. WAL-logging doubles the amount of write I/O required. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Tom Lane wrote: >> I tend to agree that truncating the file, and extending the fsync >> request mechanism to actually delete it after the next checkpoint, >> is the most reasonable route to a fix. > > Ok, I'll write a patch to do that. There's a small problem with that: DROP TABLESPACE checks that the tablespace directory is empty, and fails if it sees one of those empty files. You also run into that problem if you 1. BEGIN; CREATE TABLE; -- no commit 2. crash+restart 3. DROP TABLESPACE because we leave behind the stale file created by CREATE TABLE. The best I can think of is to rename the obsolete file to <relfilenode>.stale, when it's scheduled for deletion at next checkpoint, and check for .stale-suffixed files in GetNewRelFileNode, and delete them immediately in DropTableSpace. That still won't fix the problem with files created by a crashed transaction. For that we had a plan a long time ago: after recovery, scan the data directory for any files don't have a live row in pg_class, and write a message to log for each so that the DBA can delete them (deleting them automatically was considered too dangerous). That's probably 8.4 material, though. Thoughts? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > The best I can think of is to rename the obsolete file to > <relfilenode>.stale, when it's scheduled for deletion at next > checkpoint, and check for .stale-suffixed files in GetNewRelFileNode, > and delete them immediately in DropTableSpace. This is getting too Rube Goldbergian for my tastes. What if we just make DROP TABLESPACE force a checkpoint before proceeding? regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> The best I can think of is to rename the obsolete file to >> <relfilenode>.stale, when it's scheduled for deletion at next >> checkpoint, and check for .stale-suffixed files in GetNewRelFileNode, >> and delete them immediately in DropTableSpace. > > This is getting too Rube Goldbergian for my tastes. What if we just > make DROP TABLESPACE force a checkpoint before proceeding? True, that would work. DROP TABLESPACE should be uncommon enough that the performance hit is ok. We only need to checkpoint if the directory isn't empty, though I think that's the case more often than not; you're most likely to drop a tablespace right after dropping all relations in it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Tom Lane wrote: >> I tend to agree that truncating the file, and extending the fsync >> request mechanism to actually delete it after the next checkpoint, >> is the most reasonable route to a fix. > > Ok, I'll write a patch to do that. What is the argument against making relfilenodes globally unique by adding the xid and epoch of the creating transaction to the filename? Those 64 bits could be stuffed into 13 bytes by base-36 encoding (A-Z,0-9). The maximum length of a relfilenode would then be 10 + 1 + 13 = 24, which any reasonable filesystem should support IMHO. regards, Florian Pflug
Heikki Linnakangas wrote: > Tom Lane wrote: >> I tend to agree that truncating the file, and extending the fsync >> request mechanism to actually delete it after the next checkpoint, >> is the most reasonable route to a fix. > > Ok, I'll write a patch to do that. What is the argument against making relfilenodes globally unique by adding the xid and epoch of the creating transaction to the filename? Those 64 bits could be stuffed into 13 bytes by base-36 encoding (A-Z,0-9). The maximum length of a relfilenode would then be 10 + 1 + 13 = 24, which any reasonable filesystem should support IMHO. regards, Florian Pflug PS: Sorry if this arrives twice - I'm having a few troubles with my mail setup.
Florian G. Pflug wrote: > Heikki Linnakangas wrote: >> Tom Lane wrote: >>> I tend to agree that truncating the file, and extending the fsync >>> request mechanism to actually delete it after the next checkpoint, >>> is the most reasonable route to a fix. >> >> Ok, I'll write a patch to do that. > > What is the argument against making relfilenodes globally unique by > adding the xid and epoch of the creating transaction to the filename? > Those 64 bits could be stuffed into 13 bytes by base-36 encoding > (A-Z,0-9). The maximum length of a relfilenode would then be 10 + 1 + 13 > = 24, which any reasonable filesystem should support IMHO. The size of would be xid + epoch + oid = 96 bits, not 64 bits. That would work, but sounds like a much bigger change. sizeof(RelFileNode) would increase from 12 to 20, so any data structure that deals with RelFileNodes would take more memory. Hash function in buffer manager would get more expensive. I remember seeing that showing up in oprofile sometimes, but it'd need to be benchmarked to see if it really matters. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
"Florian G. Pflug" <fgp@phlo.org> writes: > What is the argument against making relfilenodes globally unique by adding the > xid and epoch of the creating transaction to the filename? 1. Zero chance of ever backpatching. (I know I said I wasn't excited about that, but it's still a strike against a proposedfix.) 2. Adds new fields to RelFileNode, which will be a major code change, and possibly a noticeable performance hit (biggerhashtable keys). 3. Adds new columns to pg_class, which is a real PITA ... 4. Breaks oid2name and all similar code that knows about relfilenode. regards, tom lane
Tom Lane wrote: > "Florian G. Pflug" <fgp@phlo.org> writes: >> What is the argument against making relfilenodes globally unique by adding >> the xid and epoch of the creating transaction to the filename? > > 1. Zero chance of ever backpatching. (I know I said I wasn't excited about > that, but it's still a strike against a proposed fix.) > > 2. Adds new fields to RelFileNode, which will be a major code change, and > possibly a noticeable performance hit (bigger hashtable keys). > > 3. Adds new columns to pg_class, which is a real PITA ... > > 4. Breaks oid2name and all similar code that knows about relfilenode. Ah, Ok. I was under the impression that relfilenode in pg_class is a string of some kind. In that case only GetNewRelFileNode would have needed patching... But that is obviously not the case, as I realized now :-( Thanks for setting me straight ;-) regards, Florian Pflug