Thread: RelationCreateStorage can orphan files

RelationCreateStorage can orphan files

From
Robert Haas
Date:
I notice that RelationCreateStorage() creates the main fork on disk
before writing (let alone flushing) WAL.  So if PG gets killed at that
point, we end up with an orphaned file on disk.  I think that we could
even extend the relation a few times before WAL gets written, so I
don't even think it's necessarily a zero-size file.  We could perhaps
avoid this by writing and flushing a WAL record that includes the
creating XID before touching the disk; when we replay the record, we
create the file but then delete it if the XID fails to commit before
recovery ends.  But I guess maybe our feeling is that it's just not
worth taking a performance hit for this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: RelationCreateStorage can orphan files

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I notice that RelationCreateStorage() creates the main fork on disk
> before writing (let alone flushing) WAL.  So if PG gets killed at that
> point, we end up with an orphaned file on disk.  I think that we could
> even extend the relation a few times before WAL gets written, so I
> don't even think it's necessarily a zero-size file.  We could perhaps
> avoid this by writing and flushing a WAL record that includes the
> creating XID before touching the disk; when we replay the record, we
> create the file but then delete it if the XID fails to commit before
> recovery ends.  But I guess maybe our feeling is that it's just not
> worth taking a performance hit for this?

That design is intentional.  If the file create fails, and you've
already written a WAL record that says you created it, you are flat
out screwed.  You can't even PANIC --- if you do, then the replay of
the WAL record will likely fail and PANIC again, leaving the database
dead in the water.

Orphaned files, in contrast, are completely non-dangerous --- the worst
they can do is waste a little bit of disk space.  That's a cheap price
to pay for not having an unrecoverable database after a create failure.

This is essentially the same reason why CREATE DATABASE and related
commands xlog directory copy operations only after completing them.
That potentially wastes much more than a few blocks; but it's still
non-dangerous, and far safer than the alternative.
        regards, tom lane


Re: RelationCreateStorage can orphan files

From
Robert Haas
Date:
On Wed, Sep 15, 2010 at 9:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I notice that RelationCreateStorage() creates the main fork on disk
>> before writing (let alone flushing) WAL.  So if PG gets killed at that
>> point, we end up with an orphaned file on disk.  I think that we could
>> even extend the relation a few times before WAL gets written, so I
>> don't even think it's necessarily a zero-size file.  We could perhaps
>> avoid this by writing and flushing a WAL record that includes the
>> creating XID before touching the disk; when we replay the record, we
>> create the file but then delete it if the XID fails to commit before
>> recovery ends.  But I guess maybe our feeling is that it's just not
>> worth taking a performance hit for this?
>
> That design is intentional.  If the file create fails, and you've
> already written a WAL record that says you created it, you are flat
> out screwed.  You can't even PANIC --- if you do, then the replay of
> the WAL record will likely fail and PANIC again, leaving the database
> dead in the water.

Not that this is perhaps more than of academic interest, but could you
get around this problem by making the replay of the XLOG record defer
the creation of the file until such time as it's actually written to
or the creating XID commits?  And also, if the XID does not commit,
going back and trying to remove the file (on a best effort basis)?

> Orphaned files, in contrast, are completely non-dangerous --- the worst
> they can do is waste a little bit of disk space.  That's a cheap price
> to pay for not having an unrecoverable database after a create failure.
>
> This is essentially the same reason why CREATE DATABASE and related
> commands xlog directory copy operations only after completing them.
> That potentially wastes much more than a few blocks; but it's still
> non-dangerous, and far safer than the alternative.

Thanks for the explanation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: RelationCreateStorage can orphan files

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Sep 15, 2010 at 9:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That design is intentional. �If the file create fails, and you've
>> already written a WAL record that says you created it, you are flat
>> out screwed. �You can't even PANIC --- if you do, then the replay of
>> the WAL record will likely fail and PANIC again, leaving the database
>> dead in the water.

> Not that this is perhaps more than of academic interest, but could you
> get around this problem by making the replay of the XLOG record defer
> the creation of the file until such time as it's actually written to
> or the creating XID commits?  And also, if the XID does not commit,
> going back and trying to remove the file (on a best effort basis)?

Perhaps, but it seems like a lot more complexity than is justified
by the problem.
        regards, tom lane


Re: RelationCreateStorage can orphan files

From
Robert Haas
Date:
On Wed, Sep 15, 2010 at 10:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Sep 15, 2010 at 9:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> That design is intentional.  If the file create fails, and you've
>>> already written a WAL record that says you created it, you are flat
>>> out screwed.  You can't even PANIC --- if you do, then the replay of
>>> the WAL record will likely fail and PANIC again, leaving the database
>>> dead in the water.
>
>> Not that this is perhaps more than of academic interest, but could you
>> get around this problem by making the replay of the XLOG record defer
>> the creation of the file until such time as it's actually written to
>> or the creating XID commits?  And also, if the XID does not commit,
>> going back and trying to remove the file (on a best effort basis)?
>
> Perhaps, but it seems like a lot more complexity than is justified
> by the problem.

That's sort of what I figured.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: RelationCreateStorage can orphan files

From
Bruce Momjian
Date:
Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > I notice that RelationCreateStorage() creates the main fork on disk
> > before writing (let alone flushing) WAL.  So if PG gets killed at that
> > point, we end up with an orphaned file on disk.  I think that we could
> > even extend the relation a few times before WAL gets written, so I
> > don't even think it's necessarily a zero-size file.  We could perhaps
> > avoid this by writing and flushing a WAL record that includes the
> > creating XID before touching the disk; when we replay the record, we
> > create the file but then delete it if the XID fails to commit before
> > recovery ends.  But I guess maybe our feeling is that it's just not
> > worth taking a performance hit for this?
> 
> That design is intentional.  If the file create fails, and you've
> already written a WAL record that says you created it, you are flat
> out screwed.  You can't even PANIC --- if you do, then the replay of
> the WAL record will likely fail and PANIC again, leaving the database
> dead in the water.
> 
> Orphaned files, in contrast, are completely non-dangerous --- the worst
> they can do is waste a little bit of disk space.  That's a cheap price
> to pay for not having an unrecoverable database after a create failure.
> 
> This is essentially the same reason why CREATE DATABASE and related
> commands xlog directory copy operations only after completing them.
> That potentially wastes much more than a few blocks; but it's still
> non-dangerous, and far safer than the alternative.

Is this documented in a C comment somewhere?  Obviously not in a place
Robert found.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: RelationCreateStorage can orphan files

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> This is essentially the same reason why CREATE DATABASE and related
>> commands xlog directory copy operations only after completing them.
>> That potentially wastes much more than a few blocks; but it's still
>> non-dangerous, and far safer than the alternative.

> Is this documented in a C comment somewhere?  Obviously not in a place
> Robert found.

I had thought it was documented in the discussion of WAL logging rules
in access/transam/README, but it isn't.  I'll see about adding
something.
        regards, tom lane