Thread: Why copy_relation_data only use wal when WAL archiving is enabled

Why copy_relation_data only use wal when WAL archiving is enabled

From
"Jacky Leng"
Date:
If I run the database under non-archiving mode, and execute the following 
command:    alter table t set tablespace tblspc1;
Isn't it possible that the "new t" cann't be recovered? 




Re: Why copy_relation_data only use wal when WAL archiving is enabled

From
Heikki Linnakangas
Date:
Jacky Leng wrote:
> If I run the database under non-archiving mode, and execute the following 
> command:
>      alter table t set tablespace tblspc1;
> Isn't it possible that the "new t" cann't be recovered? 

No. At the end of copy_relation_data we call smgrimmedsync, which fsyncs
the new relation file.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Why copy_relation_data only use wal when WAL archiving is enabled

From
"Jacky Leng"
Date:
> Jacky Leng wrote:
>> If I run the database under non-archiving mode, and execute the following
>> command:
>>      alter table t set tablespace tblspc1;
>> Isn't it possible that the "new t" cann't be recovered?
>
> No. At the end of copy_relation_data we call smgrimmedsync, which fsyncs
> the new relation file.

Usually it's true, but how about this situation:
* First, do the following series:   * Create two tablespace SPC1, SPC2;   * Create table T1 in SPC1 and insert some
valuesinto it, suppose T1's 
 
oid/relfilenode is OID1;   * Drop table T1;----------OID1 was released in pg_class and can be 
reused.   * Do anything that will make the next oid that'll be allocated from 
pg_class be OID1, e.g. insert     many many tuples into a relation with oid;   * Create table T2 in SPC2, and insert
somevalues into it, and its 
 
oid/relfilenode is OID1;   * Alter table T2 set tablespace SPC1;---------T2 goes to SPC1 and uses 
the same file name with old T1;
* Second, suppose that no checkpoint has occured during the upper 
series--authough not quite possible;
* Kill the database abnormaly;
* Restart the database;

Let's analyze what will happen during the recovery process:
* When T1 is re-created, it finds that its file has already been 
there--actually this file is T2's;
* "T1" ' s file(actually T2's) is re-dropped;
* ....
* T2 is re-created, and finds that its file has disappeared, so it re-create 
one;
* As copy_relation_data didn't record any xlog about T2's AlterTableSpace 
op, after recovery, we'll find that T2 is empty!!!

> -- 
>  Heikki Linnakangas
>  EnterpriseDB   http://www.enterprisedb.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
> 




Re: Why copy_relation_data only use wal when WAL archiving is enabled

From
Simon Riggs
Date:
On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
> Second, suppose that no checkpoint has occured during the upper 
> series--authough not quite possible; 

That part is irrelevant. It's forced out to disk and doesn't need
recovery, with or without the checkpoint.

There's no hole that I can see.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Why copy_relation_data only use wal when WALarchiving is enabled

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
>> Second, suppose that no checkpoint has occured during the upper 
>> series--authough not quite possible; 
> 
> That part is irrelevant. It's forced out to disk and doesn't need
> recovery, with or without the checkpoint.
> 
> There's no hole that I can see.

No, Jacky is right. The same problem exists at least with CLUSTER, and I
think there's other commands that rely on immediate fsync as well.

Attached is a shell script that demonstrates the problem on CVS HEAD
with CLUSTER. It creates two tables, T1 and T2, both with one row. Then
T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file
happens to get the same relfilenode that T1 had. Then we crash the
server, forcing a WAL replay. After that, T2 is empty. Oops.

Unfortunately I don't see any easy way to fix it. One approach would be
to avoid reusing the relfilenodes until next checkpoint, but I don't see
any nice place to keep track of OIDs that have been dropped since last
checkpoint.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Why copy_relation_data only use wal when WALarchiving is enabled

From
"Jacky Leng"
Date:
> On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
>> Second, suppose that no checkpoint has occured during the upper
>> series--authough not quite possible;
>
> That part is irrelevant. It's forced out to disk and doesn't need
> recovery, with or without the checkpoint.
>
> There's no hole that I can see.

Yes, it's really forced out.
But if there's no checkpoint, the recovery process will begin from
the time point before T1 is created, and as T1 was dropped, it'll
remove T2's file!

> -- 
>  Simon Riggs
>  2ndQuadrant  http://www.2ndQuadrant.com
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
> 




Re: Why copy_relation_data only use wal when WALarchiving is enabled

From
Heikki Linnakangas
Date:
Forgot to attach the script I promised..

You need to set $PGDATA before running the script. And psql,pg_ctl and
pg_resetxlog need to be in $PATH. After running the script, restart
postmaster and run "SELECT * FROM t2". There should be one row in the
table, but it's empty.

Heikki Linnakangas wrote:
> Simon Riggs wrote:
>> On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
>>> Second, suppose that no checkpoint has occured during the upper
>>> series--authough not quite possible;
>> That part is irrelevant. It's forced out to disk and doesn't need
>> recovery, with or without the checkpoint.
>>
>> There's no hole that I can see.
>
> No, Jacky is right. The same problem exists at least with CLUSTER, and I
> think there's other commands that rely on immediate fsync as well.
>
> Attached is a shell script that demonstrates the problem on CVS HEAD
> with CLUSTER. It creates two tables, T1 and T2, both with one row. Then
> T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file
> happens to get the same relfilenode that T1 had. Then we crash the
> server, forcing a WAL replay. After that, T2 is empty. Oops.
>
> Unfortunately I don't see any easy way to fix it. One approach would be
> to avoid reusing the relfilenodes until next checkpoint, but I don't see
> any nice place to keep track of OIDs that have been dropped since last
> checkpoint.
>


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: Why copy_relation_data only use wal when WALarchiving is enabled

From
Heikki Linnakangas
Date:
I wrote:
> Unfortunately I don't see any easy way to fix it. One approach would be
> to avoid reusing the relfilenodes until next checkpoint, but I don't see
> any nice place to keep track of OIDs that have been dropped since last
> checkpoint.

Ok, here's one idea:

Instead of deleting the file immediately on commit of DROP TABLE, the
file is truncated to release the space, but not unlink()ed, to avoid
reusing that relfilenode. The truncated file can be deleted after next
checkpoint.

Now, how does checkpoint know what to delete? We can use the fsync
request mechanism for that. When a file is truncated, a new kind of
fsync request, a "deletion request", is sent to the bgwriter, which
collects all such requests to a list. Before checkpoint calculates new
RedoRecPtr, the list is swapped with an empty one, and after writing the
new checkpoint record, all the files that were in the list are deleted.

We would leak empty files on crashes, but we leak files on crashes
anyway, so that shouldn't be an issue. This scheme wouldn't require
catalog changes, so it would be suitable for backpatching.

Any better ideas?

Do we care enough about this to fix this? Enough to backpatch? The
probability of this happening is pretty small, but the consequences are
really bad, so my vote is "yes" and "yes".

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Why copy_relation_data only use wal when WALarchiving is enabled

From
"Florian G. Pflug"
Date:
Heikki Linnakangas wrote:
> I wrote:
>> Unfortunately I don't see any easy way to fix it. One approach would be
>> to avoid reusing the relfilenodes until next checkpoint, but I don't see
>> any nice place to keep track of OIDs that have been dropped since last
>> checkpoint.
> 
> Ok, here's one idea:
> 
> Instead of deleting the file immediately on commit of DROP TABLE, the
> file is truncated to release the space, but not unlink()ed, to avoid
> reusing that relfilenode. The truncated file can be deleted after next
> checkpoint.
> 
> Now, how does checkpoint know what to delete? We can use the fsync
> request mechanism for that. When a file is truncated, a new kind of
> fsync request, a "deletion request", is sent to the bgwriter, which
> collects all such requests to a list. Before checkpoint calculates new
> RedoRecPtr, the list is swapped with an empty one, and after writing the
> new checkpoint record, all the files that were in the list are deleted.
> 
> We would leak empty files on crashes, but we leak files on crashes
> anyway, so that shouldn't be an issue. This scheme wouldn't require
> catalog changes, so it would be suitable for backpatching.
> 
> Any better ideas?
Couldn't we fix this by forcing a checkpoint before we commit the transaction 
that created the new pg_class entry for the clustered table? Or rather, more 
generally, before committing a transaction that created a new non-temporary 
relfilenode but didn't WAL-log any subsequent inserts.

Thats of course a rather sledgehammer-like approach to this problem - but at 
least for the backbranched the fix would be less intrusive...

regards, Florian Pflug



Re: Why copy_relation_data only use wal when WALarchiving is enabled

From
Simon Riggs
Date:
On Wed, 2007-10-17 at 12:11 +0100, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
> >> Second, suppose that no checkpoint has occured during the upper 
> >> series--authough not quite possible; 
> > 
> > That part is irrelevant. It's forced out to disk and doesn't need
> > recovery, with or without the checkpoint.
> > 
> > There's no hole that I can see.
> 
> No, Jacky is right. The same problem exists at least with CLUSTER, and I
> think there's other commands that rely on immediate fsync as well.
> 
> Attached is a shell script that demonstrates the problem on CVS HEAD
> with CLUSTER. It creates two tables, T1 and T2, both with one row. Then
> T1 is dropped, and T2 is CLUSTERed, so that the new T2 relation file
> happens to get the same relfilenode that T1 had. Then we crash the
> server, forcing a WAL replay. After that, T2 is empty. Oops.
> 
> Unfortunately I don't see any easy way to fix it. 

So, what you are saying is that re-using relfilenodes can cause problems
during recovery in any command that alters the relfilenode of a
relation?

If you've got a better problem statement it would be good to get that
right first before we discuss solutions.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Why copy_relation_data only use wal when WALarchiving is enabled

From
"Florian G. Pflug"
Date:
Simon Riggs wrote:
> On Wed, 2007-10-17 at 12:11 +0100, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Wed, 2007-10-17 at 17:18 +0800, Jacky Leng wrote:
>>>> Second, suppose that no checkpoint has occured during the upper 
>>>> series--authough not quite possible;
>>> That part is irrelevant. It's forced out to disk and doesn't need 
>>> recovery, with or without the checkpoint.
>>> 
>>> There's no hole that I can see.
>> No, Jacky is right. The same problem exists at least with CLUSTER, and I 
>> think there's other commands that rely on immediate fsync as well.
>> 
>> Attached is a shell script that demonstrates the problem on CVS HEAD with
>> CLUSTER. It creates two tables, T1 and T2, both with one row. Then T1 is
>> dropped, and T2 is CLUSTERed, so that the new T2 relation file happens to
>> get the same relfilenode that T1 had. Then we crash the server, forcing a
>> WAL replay. After that, T2 is empty. Oops.
>> 
>> Unfortunately I don't see any easy way to fix it.
> 
> So, what you are saying is that re-using relfilenodes can cause problems 
> during recovery in any command that alters the relfilenode of a relation?

For what I understand, I'd say that creating a relfilenode *and* subsequently
inserting data without WAL-logging causes the problem. If the relfilenode was
recently deleted, the inserts might be effectively undone upon recovery (because
we first replay the delete), but later *not* redone (because we didn't WAL-log
the inserts).

That brings me to another idea from a fix that is less heavyweight than my
previous checkpoint-before-commit suggestion.

We could make relfilenodes globally unique if we added the xid and epoch of the
creating transaction to the filename. Those are 64 bits, so if we encode them
in base 36 (using A-Z,0-9), that'd increase the length of the filenames by 13.

regards, Florian Pflug



Re: Why copy_relation_data only use wal whenWALarchiving is enabled

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> If you've got a better problem statement it would be good to get that
> right first before we discuss solutions.

Reusing a relfilenode of a deleted relation, before next checkpoint
following the commit of the deleting transaction, for an operation that
doesn't WAL log the contents of the new relation, leads to data loss on
recovery.

Or

Performing non-WAL logged operations on a relation file leads to a
truncated file on recovery, if the relfilenode of that file used to
belong to a relation that was dropped after the last checkpoint.

Happy?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Why copy_relation_data only use wal when WALarchiving is enabled

From
Heikki Linnakangas
Date:
Florian G. Pflug wrote:
> Heikki Linnakangas wrote:
>> I wrote:
>>> Unfortunately I don't see any easy way to fix it. One approach would be
>>> to avoid reusing the relfilenodes until next checkpoint, but I don't see
>>> any nice place to keep track of OIDs that have been dropped since last
>>> checkpoint.
>>
>> Ok, here's one idea:
>>
>> Instead of deleting the file immediately on commit of DROP TABLE, the
>> file is truncated to release the space, but not unlink()ed, to avoid
>> reusing that relfilenode. The truncated file can be deleted after next
>> checkpoint.
>>
>> Now, how does checkpoint know what to delete? We can use the fsync
>> request mechanism for that. When a file is truncated, a new kind of
>> fsync request, a "deletion request", is sent to the bgwriter, which
>> collects all such requests to a list. Before checkpoint calculates new
>> RedoRecPtr, the list is swapped with an empty one, and after writing the
>> new checkpoint record, all the files that were in the list are deleted.
>>
>> We would leak empty files on crashes, but we leak files on crashes
>> anyway, so that shouldn't be an issue. This scheme wouldn't require
>> catalog changes, so it would be suitable for backpatching.
>>
>> Any better ideas?
> Couldn't we fix this by forcing a checkpoint before we commit the
> transaction that created the new pg_class entry for the clustered table?
> Or rather, more generally, before committing a transaction that created
> a new non-temporary relfilenode but didn't WAL-log any subsequent inserts.

Yes, that would work. As a small optimization, you could set a flag in
shared mem whenever you delete a rel file, and skip the checkpoint when
that flag isn't set.

> Thats of course a rather sledgehammer-like approach to this problem -
> but at least for the backbranched the fix would be less intrusive...

Too much of a sledgehammer IMHO.

BTW, CREATE INDEX is also vulnerable. And in 8.3, COPY to a table
created/truncated in the same transaction.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Why copy_relation_data only use wal whenWALarchiving is enabled

From
Simon Riggs
Date:
On Wed, 2007-10-17 at 15:02 +0100, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > If you've got a better problem statement it would be good to get that
> > right first before we discuss solutions.
> 
> Reusing a relfilenode of a deleted relation, before next checkpoint
> following the commit of the deleting transaction, for an operation that
> doesn't WAL log the contents of the new relation, leads to data loss on
> recovery.

OK, thanks. 

I wasn't aware we reused refilenode ids. The code in GetNewOid() doesn't
look deterministic to me, or at least isn't meant to be.
GetNewObjectId() should be cycling around, so although the oid index
scan using SnapshotDirty won't see committed deleted rows that shouldn't
matter for 2^32 oids. So what gives?

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Why copy_relation_data only use wal whenWALarchivingis enabled

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Wed, 2007-10-17 at 15:02 +0100, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> If you've got a better problem statement it would be good to get that
>>> right first before we discuss solutions.
>> Reusing a relfilenode of a deleted relation, before next checkpoint
>> following the commit of the deleting transaction, for an operation that
>> doesn't WAL log the contents of the new relation, leads to data loss on
>> recovery.
> 
> OK, thanks. 
> 
> I wasn't aware we reused refilenode ids. The code in GetNewOid() doesn't
> look deterministic to me, or at least isn't meant to be.
> GetNewObjectId() should be cycling around, so although the oid index
> scan using SnapshotDirty won't see committed deleted rows that shouldn't
> matter for 2^32 oids. So what gives?

I don't think you still quite understand what's happening. GetNewOid()
is not interesting here, look at GetNewRelFileNode() instead. And
neither are snapshots or MVCC visibility rules.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Why copy_relation_data only use wal whenWALarchivingis enabled

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> I don't think you still quite understand what's happening. GetNewOid()
> is not interesting here, look at GetNewRelFileNode() instead. And
> neither are snapshots or MVCC visibility rules.

Simon has a legitimate objection; not that there's no bug, but that the
probability of getting bitten is exceedingly small.  The test script you
showed cheats six-ways-from-Sunday to cause an OID collision that would
never happen in practice.  The only case where it would really happen
is if a table that has existed for a long time (~ 2^32 OID creations)
gets dropped and then you're unlucky enough to recycle that exact OID
before the next checkpoint --- and then crash before the checkpoint.

I think we should think about ways to fix this, but I don't feel a need
to try to backpatch a solution.

I tend to agree that truncating the file, and extending the fsync
request mechanism to actually delete it after the next checkpoint,
is the most reasonable route to a fix.

I think the objection about leaking files on crash is wrong. We'd
have the replay of the deletion to fix things up --- it could probably
delete the file immediately, and if not could certainly put it back
on the fsync request queue.
        regards, tom lane


Re: Why copy_relation_data only use wal whenWALarchivingis enabled

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Simon has a legitimate objection; not that there's no bug, but that the
> probability of getting bitten is exceedingly small.  

Oh, if that's what he meant, he's right.

> The test script you
> showed cheats six-ways-from-Sunday to cause an OID collision that would
> never happen in practice.  The only case where it would really happen
> is if a table that has existed for a long time (~ 2^32 OID creations)
> gets dropped and then you're unlucky enough to recycle that exact OID
> before the next checkpoint --- and then crash before the checkpoint.

Yeah, it's unlikely to happen, but the consequences are horrible.

Note that it's not just DROP TABLE that's a problem, but anything that
uses smgrscheduleunlink, including CLUSTER and REINDEX.

> I tend to agree that truncating the file, and extending the fsync
> request mechanism to actually delete it after the next checkpoint,
> is the most reasonable route to a fix.

Ok, I'll write a patch to do that.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Why copy_relation_data only use wal whenWALarchivingis enabled

From
Simon Riggs
Date:
On Wed, 2007-10-17 at 17:36 +0100, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Wed, 2007-10-17 at 15:02 +0100, Heikki Linnakangas wrote:
> >> Simon Riggs wrote:
> >>> If you've got a better problem statement it would be good to get that
> >>> right first before we discuss solutions.
> >> Reusing a relfilenode of a deleted relation, before next checkpoint
> >> following the commit of the deleting transaction, for an operation that
> >> doesn't WAL log the contents of the new relation, leads to data loss on
> >> recovery.
> > 
> > OK, thanks. 
> > 
> > I wasn't aware we reused refilenode ids. The code in GetNewOid() doesn't
> > look deterministic to me, or at least isn't meant to be.
> > GetNewObjectId() should be cycling around, so although the oid index
> > scan using SnapshotDirty won't see committed deleted rows that shouldn't
> > matter for 2^32 oids. So what gives?
> 
> I don't think you still quite understand what's happening. 

Clearly. It's not a problem to admit that.

> GetNewOid()
> is not interesting here, look at GetNewRelFileNode() instead. And
> neither are snapshots or MVCC visibility rules.

Which calls GetNewOid() in all cases, AFAICS.

How does the reuse you say is happening come about? Seems like the bug
is in the reuse, not in how we cope with potential reuse.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Why copy_relation_data only use walwhenWALarchivingis enabled

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Wed, 2007-10-17 at 17:36 +0100, Heikki Linnakangas wrote:
>> Simon Riggs wrote:
>>> On Wed, 2007-10-17 at 15:02 +0100, Heikki Linnakangas wrote:
>>>> Simon Riggs wrote:
>>>>> If you've got a better problem statement it would be good to get that
>>>>> right first before we discuss solutions.
>>>> Reusing a relfilenode of a deleted relation, before next checkpoint
>>>> following the commit of the deleting transaction, for an operation that
>>>> doesn't WAL log the contents of the new relation, leads to data loss on
>>>> recovery.
>>> OK, thanks. 
>>>
>>> I wasn't aware we reused refilenode ids. The code in GetNewOid() doesn't
>>> look deterministic to me, or at least isn't meant to be.
>>> GetNewObjectId() should be cycling around, so although the oid index
>>> scan using SnapshotDirty won't see committed deleted rows that shouldn't
>>> matter for 2^32 oids. So what gives?
>> I don't think you still quite understand what's happening. 
> 
> Clearly. It's not a problem to admit that.
> 
>> GetNewOid()
>> is not interesting here, look at GetNewRelFileNode() instead. And
>> neither are snapshots or MVCC visibility rules.
> 
> Which calls GetNewOid() in all cases, AFAICS.
> 
> How does the reuse you say is happening come about? Seems like the bug
> is in the reuse, not in how we cope with potential reuse.

After a table is dropped, the dropping transaction has been committed,
and the relation file has been deleted, there's nothing preventing the
reuse. There's no trace of that relfilenode in the system (except in the
WAL, which we never look into except on WAL replay). There's a dead row
in pg_class with that relfilenode, but even that could be vacuumed away
(not that it matters because we don't examine that).

Now the problem is that there's a record in the WAL to delete a relation
file with that relfilenode. If that relfilenode was reused, we delete
the contents of the new relation file when we replay that WAL record.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Why copy_relation_data only use wal whenWALarchivingis enabled

From
Simon Riggs
Date:
On Wed, 2007-10-17 at 18:13 +0100, Heikki Linnakangas wrote:

> > The test script you
> > showed cheats six-ways-from-Sunday to cause an OID collision that would
> > never happen in practice.  The only case where it would really happen
> > is if a table that has existed for a long time (~ 2^32 OID creations)
> > gets dropped and then you're unlucky enough to recycle that exact OID
> > before the next checkpoint --- and then crash before the checkpoint.
> 
> Yeah, it's unlikely to happen, but the consequences are horrible.

When is this going to happen?

We'd need to insert 2^32 toast chunks, which is >4 TB of data, or insert
2^32 large objects, or create 2^32 tables, or any combination of the
above all within one checkpoint duration *and* exactly hit the exact
same relation.

That's a weird and huge application, a very fast server and an unlucky
DBA to hit the exact OID to be reused and then have the server crash so
we'll ever notice.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com



Re: Why copy_relation_data only use walwhenWALarchivingis enabled

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Wed, 2007-10-17 at 18:13 +0100, Heikki Linnakangas wrote:
> 
>>> The test script you
>>> showed cheats six-ways-from-Sunday to cause an OID collision that would
>>> never happen in practice.  The only case where it would really happen
>>> is if a table that has existed for a long time (~ 2^32 OID creations)
>>> gets dropped and then you're unlucky enough to recycle that exact OID
>>> before the next checkpoint --- and then crash before the checkpoint.
>> Yeah, it's unlikely to happen, but the consequences are horrible.
> 
> When is this going to happen?
> 
> We'd need to insert 2^32 toast chunks, which is >4 TB of data, or insert
> 2^32 large objects, or create 2^32 tables, or any combination of the
> above all within one checkpoint duration *and* exactly hit the exact
> same relation.

The consumption of the OIDs don't need to happen within one checkpoint
duration. As long as the DROP and the reuse happens in the same
checkpoint cycle, you're screwed.

Granted that you're not likely to ever experience OID wrap-around unless
you have a heavily used user table with OIDs. Or create/drop temp tables
a lot.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Why copy_relation_data only use wal when WALarchiving is enabled

From
"Jacky Leng"
Date:
> You need to set $PGDATA before running the script. And psql,pg_ctl and
> pg_resetxlog need to be in $PATH. After running the script, restart
> postmaster and run "SELECT * FROM t2". There should be one row in the
> table, but it's empty.

I've tried this script, and superisingly  found that T2 is not empty, just 
as it should be.
Then I see how cluster is done, and found that




Re: Why copy_relation_data only use wal whenWALarchivingis enabled

From
"Jacky Leng"
Date:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
> I tend to agree that truncating the file, and extending the fsync
> request mechanism to actually delete it after the next checkpoint,
> is the most reasonable route to a fix.
>

How about just allowing to use wal even WAL archiving is disabled?
It seems that recovery of "XLOG_HEAP_NEWPAGE" record will do the
right thing for us, look at "heap_xlog_newpage": XLogReadBuffer
with init=true will extend the block rightly and rebuild it rightly.

Someone may say that it's not worth recording xlog for operations
such as copy_relation_data, but these operations shouldn't happen
frequently. 




Re: Why copy_relation_data only use wal when WALarchiving is enabled

From
"Jacky Leng"
Date:
Sorry, send the mail wrongly just now.

> You need to set $PGDATA before running the script. And psql,pg_ctl and
> pg_resetxlog need to be in $PATH. After running the script, restart
> postmaster and run "SELECT * FROM t2". There should be one row in the
> table, but it's empty.

I've tried this script on "postgres (PostgreSQL) 8.3devel", and found that
T2 is not empty after recovery(just as it should be)---but the latest 
version
act just like what you said.

Then I see how cluster is done, and found that:
In "postgres (PostgreSQL) 8.3devel", unlike AlterTableSetTablespace (which
copys the whole relation block-by-block, and doesn't use wal under 
non-archiving
mode), Cluster copys the relation row-by-row(simple_heap_insert), which
always uses wal regardless of archiving mode. As wal exists, recovery will
cope with everything rightly.
The latest version acts differently probably because that it removes wal of 
cluser
under non-archiving mode.

So the conclusion is: we can replace wal mechanism with smgrimmedsync only 
if
relfilenode is not allowed to be reused, but this's not true, so what we 
should
keep wal.

Is it right? 




Re: Why copy_relation_data only use wal when WALarchiving is enabled

From
"Jacky Leng"
Date:
> You need to set $PGDATA before running the script. And psql,pg_ctl and
> pg_resetxlog need to be in $PATH. After running the script, restart
> postmaster and run "SELECT * FROM t2". There should be one row in the
> table, but it's empty.

I've tried this script on "postgres (PostgreSQL) 8.3devel", and found that
T2 is not empty after recovery(just as it should be)---but the latest 
version
act just like what you said.

Then I see how cluster is done, and found that:
In "postgres (PostgreSQL) 8.3devel", unlike AlterTableSetTablespace (which
copys the whole relation block-by-block, and doesn't use wal under 
non-archiving
mode), Cluster copys the relation row-by-row(simple_heap_insert), which
always uses wal regardless of archiving mode. As wal exists, recovery will
cope with everything rightly.
The latest version acts differently probably because that it removes wal of 
cluser
under non-archiving mode.

So the conclusion is: we can replace wal mechanism with smgrimmedsync only 
if
relfilenode is not allowed to be reused, but this's not true, so what we 
should
keep wal.

Is it right? 




Re: Why copy_relation_data only use wal when WALarchiving is enabled

From
"Jacky Leng"
Date:
> You need to set $PGDATA before running the script. And psql,pg_ctl and
> pg_resetxlog need to be in $PATH. After running the script, restart
> postmaster and run "SELECT * FROM t2". There should be one row in the
> table, but it's empty.

I've tried this script on "postgres (PostgreSQL) 8.3devel", and found that
T2 is not empty after recovery(just as it should be)---but the latest 
version
act just like what you said.

Then I see how cluster is done, and found that:
In "postgres (PostgreSQL) 8.3devel", unlike AlterTableSetTablespace (which
copys the whole relation block-by-block, and doesn't use wal under 
non-archiving
mode), Cluster copys the relation row-by-row(simple_heap_insert), which
always uses wal regardless of archiving mode. As wal exists, recovery will
cope with everything rightly.
The latest version acts differently probably because that it removes wal of 
cluser
under non-archiving mode.

So the conclusion is: we can replace wal mechanism with smgrimmedsync only 
if
relfilenode is not allowed to be reused, but this's not true, so what we 
should
keep wal.

Is it right? 




Re: Why copy_relation_data only use wal whenWALarchivingis enabled

From
Heikki Linnakangas
Date:
Jacky Leng wrote:
>> I tend to agree that truncating the file, and extending the fsync
>> request mechanism to actually delete it after the next checkpoint,
>> is the most reasonable route to a fix.
> 
> How about just allowing to use wal even WAL archiving is disabled?
> It seems that recovery of "XLOG_HEAP_NEWPAGE" record will do the
> right thing for us, look at "heap_xlog_newpage": XLogReadBuffer
> with init=true will extend the block rightly and rebuild it rightly.
> 
> Someone may say that it's not worth recording xlog for operations
> such as copy_relation_data, but these operations shouldn't happen
> frequently. 

Always using WAL would fix the problem, but it's a big performance hit.
WAL-logging doubles the amount of write I/O required.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Why copy_relation_data only use wal whenWALarchivingis enabled

From
Heikki Linnakangas
Date:
Heikki Linnakangas wrote:
> Tom Lane wrote:
>> I tend to agree that truncating the file, and extending the fsync
>> request mechanism to actually delete it after the next checkpoint,
>> is the most reasonable route to a fix.
> 
> Ok, I'll write a patch to do that.

There's a small problem with that: DROP TABLESPACE checks that the
tablespace directory is empty, and fails if it sees one of those empty
files. You also run into that problem if you

1. BEGIN; CREATE TABLE; -- no commit
2. crash+restart
3. DROP TABLESPACE

because we leave behind the stale file created by CREATE TABLE.

The best I can think of is to rename the obsolete file to
<relfilenode>.stale, when it's scheduled for deletion at next
checkpoint, and check for .stale-suffixed files in GetNewRelFileNode,
and delete them immediately in DropTableSpace.

That still won't fix the problem with files created by a crashed
transaction. For that we had a plan a long time ago: after recovery,
scan the data directory for any files don't have a live row in pg_class,
and write a message to log for each so that the DBA can delete them
(deleting them automatically was considered too dangerous). That's
probably 8.4 material, though.

Thoughts?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Why copy_relation_data only use wal whenWALarchivingis enabled

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> The best I can think of is to rename the obsolete file to
> <relfilenode>.stale, when it's scheduled for deletion at next
> checkpoint, and check for .stale-suffixed files in GetNewRelFileNode,
> and delete them immediately in DropTableSpace.

This is getting too Rube Goldbergian for my tastes.  What if we just
make DROP TABLESPACE force a checkpoint before proceeding?
        regards, tom lane


Re: Why copy_relation_data only use wal whenWALarchivingis enabled

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> The best I can think of is to rename the obsolete file to
>> <relfilenode>.stale, when it's scheduled for deletion at next
>> checkpoint, and check for .stale-suffixed files in GetNewRelFileNode,
>> and delete them immediately in DropTableSpace.
> 
> This is getting too Rube Goldbergian for my tastes.  What if we just
> make DROP TABLESPACE force a checkpoint before proceeding?

True, that would work. DROP TABLESPACE should be uncommon enough that
the performance hit is ok. We only need to checkpoint if the directory
isn't empty, though I think that's the case more often than not; you're
most likely to drop a tablespace right after dropping all relations in it.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Why copy_relation_data only use wal whenWALarchivingis enabled

From
"Florian G. Pflug"
Date:
Heikki Linnakangas wrote:
> Tom Lane wrote:
>> I tend to agree that truncating the file, and extending the fsync
>> request mechanism to actually delete it after the next checkpoint,
>> is the most reasonable route to a fix.
> 
> Ok, I'll write a patch to do that.

What is the argument against making relfilenodes globally unique by adding the 
xid and epoch of the creating transaction to the filename? Those 64 bits could 
be stuffed into 13 bytes by base-36 encoding (A-Z,0-9). The maximum length of a 
relfilenode would then be 10 + 1 + 13 = 24, which any reasonable filesystem 
should support IMHO.

regards, Florian Pflug



Re: Why copy_relation_data only use wal whenWALarchivingis enabled

From
"Florian G. Pflug"
Date:
Heikki Linnakangas wrote:
> Tom Lane wrote:
>> I tend to agree that truncating the file, and extending the fsync
>> request mechanism to actually delete it after the next checkpoint,
>> is the most reasonable route to a fix.
> 
> Ok, I'll write a patch to do that.

What is the argument against making relfilenodes globally unique by adding the 
xid and epoch of the creating transaction to the filename? Those 64 bits could 
be stuffed into 13 bytes by base-36 encoding (A-Z,0-9). The maximum length of a 
relfilenode would then be 10 + 1 + 13 = 24, which any reasonable filesystem 
should support IMHO.

regards, Florian Pflug

PS: Sorry if this arrives twice - I'm having a few troubles with my mail setup.


Re: Why copy_relation_data only use wal whenWALarchivingis enabled

From
Heikki Linnakangas
Date:
Florian G. Pflug wrote:
> Heikki Linnakangas wrote:
>> Tom Lane wrote:
>>> I tend to agree that truncating the file, and extending the fsync
>>> request mechanism to actually delete it after the next checkpoint,
>>> is the most reasonable route to a fix.
>>
>> Ok, I'll write a patch to do that.
> 
> What is the argument against making relfilenodes globally unique by
> adding the xid and epoch of the creating transaction to the filename?
> Those 64 bits could be stuffed into 13 bytes by base-36 encoding
> (A-Z,0-9). The maximum length of a relfilenode would then be 10 + 1 + 13
> = 24, which any reasonable filesystem should support IMHO.

The size of would be xid + epoch + oid = 96 bits, not 64 bits.

That would work, but sounds like a much bigger change.

sizeof(RelFileNode) would increase from 12 to 20, so any data structure
that deals with RelFileNodes would take more memory. Hash function in
buffer manager would get more expensive. I remember seeing that showing
up in oprofile sometimes, but it'd need to be benchmarked to see if it
really matters.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Why copy_relation_data only use wal whenWALarchivingis enabled

From
Tom Lane
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:
> What is the argument against making relfilenodes globally unique by adding the 
> xid and epoch of the creating transaction to the filename?

1. Zero chance of ever backpatching.  (I know I said I wasn't excited  about that, but it's still a strike against a
proposedfix.)
 

2. Adds new fields to RelFileNode, which will be a major code change,  and possibly a noticeable performance hit
(biggerhashtable keys).
 

3. Adds new columns to pg_class, which is a real PITA ...

4. Breaks oid2name and all similar code that knows about relfilenode.
        regards, tom lane


Re: Why copy_relation_data only use wal whenWALarchivingis enabled

From
"Florian G. Pflug"
Date:
Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>> What is the argument against making relfilenodes globally unique by adding
>> the xid and epoch of the creating transaction to the filename?
> 
> 1. Zero chance of ever backpatching.  (I know I said I wasn't excited about
> that, but it's still a strike against a proposed fix.)
> 
> 2. Adds new fields to RelFileNode, which will be a major code change, and
> possibly a noticeable performance hit (bigger hashtable keys).
> 
> 3. Adds new columns to pg_class, which is a real PITA ...
> 
> 4. Breaks oid2name and all similar code that knows about relfilenode.

Ah, Ok. I was under the impression that relfilenode in pg_class is a string of
some kind. In that case only GetNewRelFileNode would have needed patching...
But that is obviously not the case, as I realized now :-(

Thanks for setting me straight ;-)

regards, Florian Pflug