Thread: Cleaning up unreferenced table files

Cleaning up unreferenced table files

From
Heikki Linnakangas
Date:
Here's a patch for the TODO item "Remove unreferenced table files created by transactions
that were in-progress when the server terminated abruptly."

It adds a new function, CleanupStaleRelFiles, that scans through the data
directory and removes all table files that are not mentioned in pg_class
of the corresponding database. CleanupStaleRelFiles is called after WAL
recovery.

Actually, the patch doesn't currently delete the files, just issues a
warning. Testing is easier if the files don't keep getting deleted :).

The patch also adds a GetTablespacePath function similar to
GetDatabasePath that constructs the path to a tablespace symbolic link.
commands/tablespace.c is modified to use it, in addition to the new
CleanupStaleRelFiles function.

- Heikki

Attachment

Re: Cleaning up unreferenced table files

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> Here's a patch for the TODO item "Remove unreferenced table files created by transactions
> that were in-progress when the server terminated abruptly."

xlog.c is a fairly random place to put that functionality.  Didn't it
strike any warning bells for you when you had to add so many new
#includes?  I'm not entirely sure where this should go, but not there.

            regards, tom lane

Re: Cleaning up unreferenced table files

From
Heikki Linnakangas
Date:
On Sat, 5 Mar 2005, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> Here's a patch for the TODO item "Remove unreferenced table files created by transactions
>> that were in-progress when the server terminated abruptly."
>
> xlog.c is a fairly random place to put that functionality.  Didn't it
> strike any warning bells for you when you had to add so many new
> #includes?  I'm not entirely sure where this should go, but not there.

Yeah actually it did, but I forgot about it along the way. How about
putting it in a file of its own in backend/catalog? There's some code that
also deals with the data directories. Or straight into backend/storage.

- Heikki

Re: Cleaning up unreferenced table files

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On Sat, 5 Mar 2005, Tom Lane wrote:
>> xlog.c is a fairly random place to put that functionality.  Didn't it
>> strike any warning bells for you when you had to add so many new
>> #includes?  I'm not entirely sure where this should go, but not there.

> Yeah actually it did, but I forgot about it along the way. How about
> putting it in a file of its own in backend/catalog? There's some code that
> also deals with the data directories. Or straight into backend/storage.

Actually, you could make some case for putting it in utils/init/ beside
flatfiles.c, which has got much the same sort of issues to deal with.

I think though that we ought to first consider the question of whether
we *want* this functionality.  On reflection I'm fairly nervous about
the idea of actually deleting anything during startup --- seems like a
good recipe for turning small failures into large failures.  But if
we're not going to delete anything then it's questionable whether we
need to code it like this at all.  It'd certainly be easier and safer to
examine these tables after the system is up and running normally.

            regards, tom lane

Re: Cleaning up unreferenced table files

From
Bruce Momjian
Date:
Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > On Sat, 5 Mar 2005, Tom Lane wrote:
> >> xlog.c is a fairly random place to put that functionality.  Didn't it
> >> strike any warning bells for you when you had to add so many new
> >> #includes?  I'm not entirely sure where this should go, but not there.
>
> > Yeah actually it did, but I forgot about it along the way. How about
> > putting it in a file of its own in backend/catalog? There's some code that
> > also deals with the data directories. Or straight into backend/storage.
>
> Actually, you could make some case for putting it in utils/init/ beside
> flatfiles.c, which has got much the same sort of issues to deal with.
>
> I think though that we ought to first consider the question of whether
> we *want* this functionality.  On reflection I'm fairly nervous about
> the idea of actually deleting anything during startup --- seems like a
> good recipe for turning small failures into large failures.  But if
> we're not going to delete anything then it's questionable whether we
> need to code it like this at all.  It'd certainly be easier and safer to
> examine these tables after the system is up and running normally.

Let's discuss this.  The patch as submitted checks for unreferenced
files on bootup and reports them in the log file, but does not delete
them.  That seems like the proper behavior.  I think we delete from
pgsql_tmp on bootup, but we _know_ those aren't referenced.

What other user interface would trigger this if we did it after startup?
Wouldn't we have to lock pg_class against VACUUM while we scan the file
system, and are we sure we do things in pg_class or the file system
first consistently?  It seems much more prone to error doing it while
the system is running.

I guess I am happy with just reporting during startup like the patch
does now.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Cleaning up unreferenced table files

From
Heikki Linnakangas
Date:
On Mon, 25 Apr 2005, Bruce Momjian wrote:

> Tom Lane wrote:
...
>> I think though that we ought to first consider the question of whether
>> we *want* this functionality.  On reflection I'm fairly nervous about
>> the idea of actually deleting anything during startup --- seems like a
>> good recipe for turning small failures into large failures.  But if
>> we're not going to delete anything then it's questionable whether we
>> need to code it like this at all.  It'd certainly be easier and safer to
>> examine these tables after the system is up and running normally.
>
> Let's discuss this.  The patch as submitted checks for unreferenced
> files on bootup and reports them in the log file, but does not delete
> them.  That seems like the proper behavior.  I think we delete from
> pgsql_tmp on bootup, but we _know_ those aren't referenced.
>
> What other user interface would trigger this if we did it after startup?
> Wouldn't we have to lock pg_class against VACUUM while we scan the file
> system, and are we sure we do things in pg_class or the file system
> first consistently?  It seems much more prone to error doing it while
> the system is running.

I agree.

Also, you can only have stale files after a backend crash, since they are
normally cleaned up at the end of transaction. If it was a separate
program or command, the administrator would have to be aware
of the issue. Otherwise, he wouldn't know he needs to run it after a
crash.

I feel that crashes that leaves behind stale files are rare. You
would need an application that creates/drops tables as part of normal
operation. Some kind of a large batch load might do that: BEGIN, CREATE
TABLE foo, COPY 1 GB of data, COMMIT.

The nasty thing right now is, you might end up with 1 GB of wasted disk
space, and never even know it.

> I guess I am happy with just reporting during startup like the patch
> does now.

Ok. I'll fix the design issues Tom addressed earlier, add documentation,
and resubmit.

We can come back to this after a release or two, when we have more
confidence in the feature. Maybe we'll also get some feedback on how often
those stale files occur in practice.

- Heikki

Re: Cleaning up unreferenced table files

From
Heikki Linnakangas
Date:
On Tue, 26 Apr 2005, Heikki Linnakangas wrote:

> On Mon, 25 Apr 2005, Bruce Momjian wrote:
...
>> I guess I am happy with just reporting during startup like the patch
>> does now.
>
> Ok. I'll fix the design issues Tom addressed earlier, add documentation, and
> resubmit.

Here you go.

The new functionality is now separated in a new file
backend/utils/init/cleanup.c.

There was code in many places that constructs the path to a tablespace
directory. I refactored that into a new function called GetTablespacePath
and put it next to GetDatabasePath in catalog.c.

I added a section under the "Routine Database Maintenance Tasks" that
basically gives a heads up that these notifications can appear in the log
after a crash.

- Heikki

Re: Cleaning up unreferenced table files

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> I feel that crashes that leaves behind stale files are rare.

Indeed, and getting more so all the time ... which makes me question
the value of doing anything about this at all.

            regards, tom lane

Re: Cleaning up unreferenced table files

From
Heikki Linnakangas
Date:
On Tue, 26 Apr 2005, Alvaro Herrera wrote:

> You forgot the attachment?

Damn. It happens time after time...

- Heikki

Attachment

Re: Cleaning up unreferenced table files

From
Heikki Linnakangas
Date:
On Tue, 26 Apr 2005, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> I feel that crashes that leaves behind stale files are rare.
>
> Indeed, and getting more so all the time ...

How so? Have changes been made in those parts of the code?

> which makes me question the value of doing anything about this at all.

What bothers me is that we currently have no means of knowing if it
happens and how often it happens, so we're just guessing that "it's
rare". How rare? We don't know.

If nobody ever runs into this issue in production, and this whole exercise
turns out to be completely unnecessary, at least we'll know. That alone
makes me feel better.

The only drawback of the patch that I can see is the performance impact on
recovery. And I think the time it takes to scan the data directories isn't
much compared to WAL replay.

- Heikki

Re: Cleaning up unreferenced table files

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On Tue, 26 Apr 2005, Tom Lane wrote:
>> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>>> I feel that crashes that leaves behind stale files are rare.
>>
>> Indeed, and getting more so all the time ...

> How so? Have changes been made in those parts of the code?

No, just that the overall reliability of Postgres keeps improving.

At the time that TODO entry was created, I don't think we even had the
ability to roll back table creates/drops properly, so there were
scenarios in which unreferenced files could be left behind without even
assuming any software error.  And the prevalence of backend crashes was
way higher than it is now, too.  So I think a good argument could be
made that the TODO item isn't nearly as important as it was at the time.

> If nobody ever runs into this issue in production, and this whole exercise
> turns out to be completely unnecessary, at least we'll know. That alone
> makes me feel better.

We will know no such thing, unless the patch is made to announce the
problem so intrusively that people are certain to see it *and report it
to us*.  Which I don't think will be acceptable.

            regards, tom lane

Re: Cleaning up unreferenced table files

From
Heikki Linnakangas
Date:
On Wed, 27 Apr 2005, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> On Tue, 26 Apr 2005, Tom Lane wrote:
>>> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>>>> I feel that crashes that leaves behind stale files are rare.
>>>
>>> Indeed, and getting more so all the time ...
>
>> How so? Have changes been made in those parts of the code?
>
> No, just that the overall reliability of Postgres keeps improving.
>
> At the time that TODO entry was created, I don't think we even had the
> ability to roll back table creates/drops properly, so there were
> scenarios in which unreferenced files could be left behind without even
> assuming any software error.  And the prevalence of backend crashes was
> way higher than it is now, too.  So I think a good argument could be
> made that the TODO item isn't nearly as important as it was at the time.

*shrug*. I won't push any harder if you feel strongly against the patch.

Should we just remove the TODO item? And/or document the issue?

- Heikki

Re: Cleaning up unreferenced table files

From
Bruce Momjian
Date:
Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > On Tue, 26 Apr 2005, Tom Lane wrote:
> >> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> >>> I feel that crashes that leaves behind stale files are rare.
> >>
> >> Indeed, and getting more so all the time ...
>
> > How so? Have changes been made in those parts of the code?
>
> No, just that the overall reliability of Postgres keeps improving.
>
> At the time that TODO entry was created, I don't think we even had the
> ability to roll back table creates/drops properly, so there were
> scenarios in which unreferenced files could be left behind without even
> assuming any software error.  And the prevalence of backend crashes was
> way higher than it is now, too.  So I think a good argument could be
> made that the TODO item isn't nearly as important as it was at the time.
>
> > If nobody ever runs into this issue in production, and this whole exercise
> > turns out to be completely unnecessary, at least we'll know. That alone
> > makes me feel better.
>
> We will know no such thing, unless the patch is made to announce the
> problem so intrusively that people are certain to see it *and report it
> to us*.  Which I don't think will be acceptable.

Well, if putting it in the server logs isn't enough, I don't know what
is.

I think we do need the patch, at least to find out if there is an issue
we don't know about.  I don't see the hard in it.

Heikki, would you time startup and tell us what percentage of time is
taken by the routines?  Or I can do it.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Cleaning up unreferenced table files

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I think we do need the patch, at least to find out if there is an issue
> we don't know about.

My point is that we won't find out anything, because we will have no
idea if people are noticing the log entries at all, much less telling
us about 'em.  Of course if they do tell us then we'd know something,
but what I am expecting is a vast silence.  We will not be able to tell
the difference between "no problem" and "very infrequent problem" any
better than we can now.

            regards, tom lane

Re: Cleaning up unreferenced table files

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I think we do need the patch, at least to find out if there is an issue
> > we don't know about.
>
> My point is that we won't find out anything, because we will have no
> idea if people are noticing the log entries at all, much less telling
> us about 'em.  Of course if they do tell us then we'd know something,
> but what I am expecting is a vast silence.  We will not be able to tell
> the difference between "no problem" and "very infrequent problem" any
> better than we can now.

I think people do look at their logs.  We are certainly better off
finding it than having no reporting at all.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Cleaning up unreferenced table files

From
Bruce Momjian
Date:
Uh, you forgot to add cleanup.h.

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> On Tue, 26 Apr 2005, Alvaro Herrera wrote:
>
> > You forgot the attachment?
>
> Damn. It happens time after time...
>
> - Heikki

Content-Description:

[ Attachment, skipping... ]

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Cleaning up unreferenced table files

From
Heikki Linnakangas
Date:
How embarrasing...

I hope it's all there now.

On Thu, 28 Apr 2005, Bruce Momjian wrote:

> Uh, you forgot to add cleanup.h.
>
> ---------------------------------------------------------------------------
>
> Heikki Linnakangas wrote:
>> On Tue, 26 Apr 2005, Alvaro Herrera wrote:
>>
>>> You forgot the attachment?
>>
>> Damn. It happens time after time...
>>
>> - Heikki
>
> Content-Description:
>
> [ Attachment, skipping... ]
>
> --
>  Bruce Momjian                        |  http://candle.pha.pa.us
>  pgman@candle.pha.pa.us               |  (610) 359-1001
>  +  If your life is a hard drive,     |  13 Roberts Road
>  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
>

- Heikki

Attachment

Re: Cleaning up unreferenced table files

From
Bruce Momjian
Date:
Applied.

---------------------------------------------------------------------------

pgman wrote:
> Tom Lane wrote:
> > Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > > On Tue, 26 Apr 2005, Tom Lane wrote:
> > >> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > >>> I feel that crashes that leaves behind stale files are rare.
> > >>
> > >> Indeed, and getting more so all the time ...
> >
> > > How so? Have changes been made in those parts of the code?
> >
> > No, just that the overall reliability of Postgres keeps improving.
> >
> > At the time that TODO entry was created, I don't think we even had the
> > ability to roll back table creates/drops properly, so there were
> > scenarios in which unreferenced files could be left behind without even
> > assuming any software error.  And the prevalence of backend crashes was
> > way higher than it is now, too.  So I think a good argument could be
> > made that the TODO item isn't nearly as important as it was at the time.
>
> Agreed, but I don't think we are ever going to have the file system obey
> the same semantics as the database, so there will always be corner cases
> where we have the possibility for unreferenced files.
>
> > > If nobody ever runs into this issue in production, and this whole exercise
> > > turns out to be completely unnecessary, at least we'll know. That alone
> > > makes me feel better.
> >
> > We will know no such thing, unless the patch is made to announce the
> > problem so intrusively that people are certain to see it *and report it
> > to us*.  Which I don't think will be acceptable.
>
> You are right that checking only after a server crash isn't going to
> help us very much, because the odds someone is going to look at the logs
> just at that moment are slim, and if they stop and start the server
> again, the message will not appear so they will think it is fixed.
>
> Though the problem is caused by a server crash, it remains even after a
> clean startup.  My new version of the patch checks not just after a
> server crash, but every time the server starts up.  Here are some
> timings:
>
>     3dbs    without patch    0m0.230s  300 files
>     3dbs    with patch    0m0.230s  300 files
>     100dbs    with patch    0m0.270s  10000 files
>
> (The timing test started the server and waited for the message "database
> system is ready".)
>
> You can see that for three empty databases (only system tables/indexes),
> the patch has no impact.  For 100 databases, or 10,000 files, there is a
> 17% increase in startup time, which seems acceptable.
>
> The message printed in the logs is:
>
>     LOG:  The table or index file "/u/pgsql/data/base/1/100001" is stale and can be safely removed
>
> I made a few changes to the patch. I changed 'clean' to 'check' because
> I don't think we are ever going to remove the files on startup (If we
> did we could do the cleanup just on crash recovery startup.)  I changed
> the message from NOTICE to LOG, I made it run on all startups and not
> just crash recovery, and it was missing FreeDir() calls.
>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

> Index: doc/src/sgml/maintenance.sgml
> ===================================================================
> RCS file: /cvsroot/pgsql/doc/src/sgml/maintenance.sgml,v
> retrieving revision 1.41
> diff -c -c -r1.41 maintenance.sgml
> *** doc/src/sgml/maintenance.sgml    20 Feb 2005 02:21:26 -0000    1.41
> --- doc/src/sgml/maintenance.sgml    1 May 2005 04:30:48 -0000
> ***************
> *** 474,479 ****
> --- 474,496 ----
>     </para>
>    </sect1>
>
> +  <sect1 id="check-files-after-crash">
> +   <title>Check files after crash</title>
> +
> +   <indexterm zone="check-files-after-crash">
> +    <primary>stale file</primary>
> +   </indexterm>
> +
> +   <para>
> +    <productname>PostgreSQL</productname> recovers automatically after crash
> +    using the write-ahead log (see <xref linkend="wal">) and no manual
> +    operations are normally needed. However, if there was a transaction running
> +    when the crash occured that created or dropped a relation, the
> +    transaction might have left a stale file in the data directory. If this
> +    happens, you will get a notice in the log file stating which files can be
> +    deleted.
> +   </para>
> +  </sect1>
>
>    <sect1 id="logfile-maintenance">
>     <title>Log File Maintenance</title>
> Index: src/backend/access/transam/xlog.c
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
> retrieving revision 1.189
> diff -c -c -r1.189 xlog.c
> *** src/backend/access/transam/xlog.c    28 Apr 2005 21:47:10 -0000    1.189
> --- src/backend/access/transam/xlog.c    1 May 2005 04:30:52 -0000
> ***************
> *** 43,48 ****
> --- 43,49 ----
>   #include "utils/builtins.h"
>   #include "utils/guc.h"
>   #include "utils/relcache.h"
> + #include "utils/flatfiles.h"
>
>
>   /*
> ***************
> *** 4525,4530 ****
> --- 4526,4533 ----
>
>           CreateCheckPoint(true, true);
>
> +         CheckStaleRelFiles();
> +
>           /*
>            * Close down recovery environment
>            */
> ***************
> *** 4536,4541 ****
> --- 4539,4550 ----
>            */
>           remove_backup_label();
>       }
> +     else
> +     {
> +         XLogInitRelationCache();
> +         CheckStaleRelFiles();
> +         XLogCloseRelationCache();
> +     }
>
>       /*
>        * Preallocate additional log files, if wanted.
> Index: src/backend/catalog/catalog.c
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/catalog/catalog.c,v
> retrieving revision 1.59
> diff -c -c -r1.59 catalog.c
> *** src/backend/catalog/catalog.c    14 Apr 2005 20:03:23 -0000    1.59
> --- src/backend/catalog/catalog.c    1 May 2005 04:30:52 -0000
> ***************
> *** 106,111 ****
> --- 106,144 ----
>       return path;
>   }
>
> + /*
> +  * GetTablespacePath    - construct path to a tablespace symbolic link
> +  *
> +  * Result is a palloc'd string.
> +  *
> +  * XXX this must agree with relpath and GetDatabasePath!
> +  */
> + char *
> + GetTablespacePath(Oid spcNode)
> + {
> +     int            pathlen;
> +     char       *path;
> +
> +     Assert(spcNode != GLOBALTABLESPACE_OID);
> +
> +     if (spcNode == DEFAULTTABLESPACE_OID)
> +     {
> +         /* The default tablespace is {datadir}/base */
> +         pathlen = strlen(DataDir) + 5 + 1;
> +         path = (char *) palloc(pathlen);
> +         snprintf(path, pathlen, "%s/base",
> +                  DataDir);
> +     }
> +     else
> +     {
> +         /* All other tablespaces have symlinks in pg_tblspc */
> +         pathlen = strlen(DataDir) + 11 + OIDCHARS + 1;
> +         path = (char *) palloc(pathlen);
> +         snprintf(path, pathlen, "%s/pg_tblspc/%u",
> +                  DataDir, spcNode);
> +     }
> +     return path;
> + }
>
>   /*
>    * IsSystemRelation
> Index: src/backend/commands/tablespace.c
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/commands/tablespace.c,v
> retrieving revision 1.17
> diff -c -c -r1.17 tablespace.c
> *** src/backend/commands/tablespace.c    14 Apr 2005 20:03:24 -0000    1.17
> --- src/backend/commands/tablespace.c    1 May 2005 04:30:53 -0000
> ***************
> *** 341,348 ****
>       /*
>        * All seems well, create the symlink
>        */
> !     linkloc = (char *) palloc(strlen(DataDir) + 11 + 10 + 1);
> !     sprintf(linkloc, "%s/pg_tblspc/%u", DataDir, tablespaceoid);
>
>       if (symlink(location, linkloc) < 0)
>           ereport(ERROR,
> --- 341,347 ----
>       /*
>        * All seems well, create the symlink
>        */
> !     linkloc = GetTablespacePath(tablespaceoid);
>
>       if (symlink(location, linkloc) < 0)
>           ereport(ERROR,
> ***************
> *** 495,502 ****
>       char       *subfile;
>       struct stat st;
>
> !     location = (char *) palloc(strlen(DataDir) + 11 + 10 + 1);
> !     sprintf(location, "%s/pg_tblspc/%u", DataDir, tablespaceoid);
>
>       /*
>        * Check if the tablespace still contains any files.  We try to rmdir
> --- 494,500 ----
>       char       *subfile;
>       struct stat st;
>
> !     location = GetTablespacePath(tablespaceoid);
>
>       /*
>        * Check if the tablespace still contains any files.  We try to rmdir
> ***************
> *** 1036,1043 ****
>           set_short_version(location);
>
>           /* Create the symlink if not already present */
> !         linkloc = (char *) palloc(strlen(DataDir) + 11 + 10 + 1);
> !         sprintf(linkloc, "%s/pg_tblspc/%u", DataDir, xlrec->ts_id);
>
>           if (symlink(location, linkloc) < 0)
>           {
> --- 1034,1040 ----
>           set_short_version(location);
>
>           /* Create the symlink if not already present */
> !         linkloc = GetTablespacePath(xlrec->ts_id);
>
>           if (symlink(location, linkloc) < 0)
>           {
> Index: src/backend/utils/adt/misc.c
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/utils/adt/misc.c,v
> retrieving revision 1.40
> diff -c -c -r1.40 misc.c
> *** src/backend/utils/adt/misc.c    31 Dec 2004 22:01:22 -0000    1.40
> --- src/backend/utils/adt/misc.c    1 May 2005 04:30:53 -0000
> ***************
> *** 26,31 ****
> --- 26,32 ----
>   #include "funcapi.h"
>   #include "catalog/pg_type.h"
>   #include "catalog/pg_tablespace.h"
> + #include "catalog/catalog.h"
>
>   #define atooid(x)  ((Oid) strtoul((x), NULL, 10))
>
> ***************
> *** 144,154 ****
>
>           fctx = palloc(sizeof(ts_db_fctx));
>
> -         /*
> -          * size = path length + tablespace dirname length + 2 dir sep
> -          * chars + oid + terminator
> -          */
> -         fctx->location = (char *) palloc(strlen(DataDir) + 11 + 10 + 1);
>           if (tablespaceOid == GLOBALTABLESPACE_OID)
>           {
>               fctx->dirdesc = NULL;
> --- 145,150 ----
> ***************
> *** 157,168 ****
>           }
>           else
>           {
> !             if (tablespaceOid == DEFAULTTABLESPACE_OID)
> !                 sprintf(fctx->location, "%s/base", DataDir);
> !             else
> !                 sprintf(fctx->location, "%s/pg_tblspc/%u", DataDir,
> !                         tablespaceOid);
> !
>               fctx->dirdesc = AllocateDir(fctx->location);
>
>               if (!fctx->dirdesc)
> --- 153,159 ----
>           }
>           else
>           {
> !             fctx->location = GetTablespacePath(tablespaceOid);
>               fctx->dirdesc = AllocateDir(fctx->location);
>
>               if (!fctx->dirdesc)
> Index: src/backend/utils/init/Makefile
> ===================================================================
> RCS file: /cvsroot/pgsql/src/backend/utils/init/Makefile,v
> retrieving revision 1.18
> diff -c -c -r1.18 Makefile
> *** src/backend/utils/init/Makefile    20 Feb 2005 02:22:00 -0000    1.18
> --- src/backend/utils/init/Makefile    1 May 2005 04:30:53 -0000
> ***************
> *** 12,18 ****
>   top_builddir = ../../../..
>   include $(top_builddir)/src/Makefile.global
>
> ! OBJS = flatfiles.o globals.o miscinit.o postinit.o
>
>   all: SUBSYS.o
>
> --- 12,18 ----
>   top_builddir = ../../../..
>   include $(top_builddir)/src/Makefile.global
>
> ! OBJS = flatfiles.o globals.o miscinit.o postinit.o checkfiles.o
>
>   all: SUBSYS.o
>
> Index: src/include/catalog/catalog.h
> ===================================================================
> RCS file: /cvsroot/pgsql/src/include/catalog/catalog.h,v
> retrieving revision 1.30
> diff -c -c -r1.30 catalog.h
> *** src/include/catalog/catalog.h    31 Dec 2004 22:03:24 -0000    1.30
> --- src/include/catalog/catalog.h    1 May 2005 04:30:54 -0000
> ***************
> *** 19,24 ****
> --- 19,25 ----
>
>   extern char *relpath(RelFileNode rnode);
>   extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
> + extern char *GetTablespacePath(Oid spcNode);
>
>   extern bool IsSystemRelation(Relation relation);
>   extern bool IsToastRelation(Relation relation);
> Index: src/include/utils/flatfiles.h
> ===================================================================
> RCS file: /cvsroot/pgsql/src/include/utils/flatfiles.h,v
> retrieving revision 1.1
> diff -c -c -r1.1 flatfiles.h
> *** src/include/utils/flatfiles.h    20 Feb 2005 02:22:07 -0000    1.1
> --- src/include/utils/flatfiles.h    1 May 2005 04:30:54 -0000
> ***************
> *** 30,33 ****
> --- 30,36 ----
>
>   extern Datum flatfile_update_trigger(PG_FUNCTION_ARGS);
>
> + /* from checkfiles.c */
> + extern void CheckStaleRelFiles(void);
> +
>   #endif   /* FLATFILES_H */

> /*-------------------------------------------------------------------------
>  *
>  * checkfiles.c
>  *      support to clean up stale relation files on crash recovery
>  *
>  * If a backend crashes while in a transaction that has created or
>  * deleted a relfilenode, a stale file can be left over in the data
>  * directory. This file contains routines to clean up those stale
>  * files on recovery.
>  *
>  * $PostgreSQL$
>  *
>  *-------------------------------------------------------------------------
>  */
> #include "postgres.h"
>
> #include "storage/fd.h"
>
> #include "utils/flatfiles.h"
> #include "miscadmin.h"
> #include "catalog/pg_tablespace.h"
> #include "catalog/catalog.h"
> #include "access/skey.h"
> #include "utils/fmgroids.h"
> #include "access/relscan.h"
> #include "access/heapam.h"
> #include "utils/resowner.h"
>
> static void CheckStaleRelFilesFrom(Oid tablespaceoid, Oid dboid);
> static void CheckStaleRelFilesFromTablespace(Oid tablespaceoid);
>
> /* Like AllocateDir, but ereports on failure */
> static DIR *
> AllocateDirChecked(char *path)
> {
>     DIR *dirdesc = AllocateDir(path);
>     if (dirdesc == NULL)
>         ereport(ERROR,
>                 (errcode_for_file_access(),
>                  errmsg("could not open directory \"%s\": %m",
>                         path)));
>     return dirdesc;
> }
>
> /*
>  * Scan through all tablespaces for relations left over
>  * by aborted transactions.
>  *
>  * For example, if a transaction issues
>  * BEGIN; CREATE TABLE foobar ();
>  * and then the backend crashes, the file is left in the
>  * tablespace until CheckStaleRelFiles deletes it.
>  */
> void
> CheckStaleRelFiles(void)
> {
>     DIR                *dirdesc;
>     struct dirent *de;
>     char       *path;
>     int pathlen;
>
>     pathlen = strlen(DataDir) + 11 + 1;
>     path = (char *) palloc(pathlen);
>     snprintf(path, pathlen, "%s/pg_tblspc/", DataDir);
>     dirdesc = AllocateDirChecked(path);
>     while ((de = readdir(dirdesc)) != NULL)
>     {
>         char *invalid;
>         Oid tablespaceoid;
>
>         /* Check that the directory name looks like valid tablespace link.    */
>         tablespaceoid = (Oid) strtol(de->d_name, &invalid, 10);
>         if(invalid[0] == '\0')
>             CheckStaleRelFilesFromTablespace(tablespaceoid);
>     }
>     FreeDir(dirdesc);
>     pfree(path);
>
>     CheckStaleRelFilesFromTablespace(DEFAULTTABLESPACE_OID);
> }
>
> /* Scan a specific tablespace for stale relations */
> static void
> CheckStaleRelFilesFromTablespace(Oid tablespaceoid)
> {
>     DIR                *dirdesc;
>     struct dirent *de;
>     char       *path;
>
>     path = GetTablespacePath(tablespaceoid);
>
>     dirdesc = AllocateDirChecked(path);
>     while ((de = readdir(dirdesc)) != NULL)
>     {
>         char *invalid;
>         Oid dboid;
>
>         dboid = (Oid) strtol(de->d_name, &invalid, 10);
>         if(invalid[0] == '\0')
>             CheckStaleRelFilesFrom(tablespaceoid, dboid);
>     }
>     FreeDir(dirdesc);
>     pfree(path);
> }
>
> /* Scan a specific database in a specific tablespace for stale relations.
>  *
>  * First, pg_class for the database is opened, and the relfilenodes of all
>  * relations mentioned there are stored in a hash table.
>  *
>  * Then the directory is scanned. Every file in the directory that's not
>  * found in pg_class (the hash table) is logged.
>  */
> static void
> CheckStaleRelFilesFrom(Oid tablespaceoid, Oid dboid)
> {
>     DIR *dirdesc;
>     struct dirent *de;
>     HASHCTL hashctl;
>     HTAB *relfilenodeHash;
>     MemoryContext mcxt;
>     RelFileNode rnode;
>     char *path;
>
>     /* We create a private memory context so that we can easily deallocate
>      * the hash table and its contents
>      */
>     mcxt = AllocSetContextCreate(TopMemoryContext, "CheckStaleRelFiles",
>                                  ALLOCSET_DEFAULT_MINSIZE,
>                                  ALLOCSET_DEFAULT_INITSIZE,
>                                  ALLOCSET_DEFAULT_MAXSIZE);
>
>     hashctl.hash = tag_hash;
>     /* The entry contents is not used for anything, we just check
>      * if an oid is in the hash table or not. */
>     hashctl.keysize = sizeof(Oid);
>     hashctl.entrysize = 1;
>     hashctl.hcxt = mcxt;
>     relfilenodeHash = hash_create("relfilenodeHash", 100, &hashctl,
>                                   HASH_FUNCTION
>                                   | HASH_ELEM | HASH_CONTEXT);
>
>     /* Read all relfilenodes from pg_class into the hash table */
>     {
>         ResourceOwner owner, oldowner;
>         Relation        rel;
>         HeapScanDesc scan;
>         HeapTuple tuple;
>
>         /* Need a resowner to keep the heapam and buffer code happy */
>         owner = ResourceOwnerCreate(NULL, "CheckStaleRelFiles");
>         oldowner = CurrentResourceOwner;
>         CurrentResourceOwner = owner;
>
>         rnode.spcNode = tablespaceoid;
>         rnode.dbNode = dboid;
>         rnode.relNode = RelationRelationId;
>         rel = XLogOpenRelation(true, 0 , rnode);
>
>         scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
>         while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
>         {
>             Form_pg_class classform = (Form_pg_class) GETSTRUCT(tuple);
>
>             hash_search(relfilenodeHash, &classform->relfilenode,
>                         HASH_ENTER, NULL);
>         }
>         heap_endscan(scan);
>
>         XLogCloseRelation(rnode);
>         CurrentResourceOwner = oldowner;
>         ResourceOwnerDelete(owner);
>     }
>
>     /* Scan the directory */
>     path = GetDatabasePath(dboid, tablespaceoid);
>
>     dirdesc = AllocateDirChecked(path);
>     while ((de = readdir(dirdesc)) != NULL)
>     {
>         char *invalid;
>         Oid relfilenode;
>
>         relfilenode = strtol(de->d_name, &invalid, 10);
>         if(invalid[0] == '\0')
>         {
>             /* Filename was a valid number, check if pg_class knows
>                about it */
>             if(hash_search(relfilenodeHash, &relfilenode,
>                            HASH_FIND, NULL) == NULL)
>             {
>                 char *filepath;
>                 rnode.spcNode = tablespaceoid;
>                 rnode.dbNode = dboid;
>                 rnode.relNode = relfilenode;
>
>                 filepath = relpath(rnode);
>
>                 ereport(LOG,
>                         (errcode_for_file_access(),
>                          errmsg("The table or index file \"%s\" is stale and can be safely removed",
>                                 filepath)));
>                 pfree(filepath);
>             }
>         }
>     }
>     FreeDir(dirdesc);
>     pfree(path);
>     hash_destroy(relfilenodeHash);
>     MemoryContextDelete(mcxt);
> }

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Cleaning up unreferenced table files

From
Simon Riggs
Date:
Heikki,

Good patch.

...The patch makes no mention of temporary files, which are left there
after a crash.

Temp files are only removed on a normal startup, whereas the patch
invokes removal on both normal startup and crash recovery. That doesn't
make much sense...

Also, temp file deletion happens in the postmaster, not at the end of
recovery in xlog.c.

Could we rationalise those two? One place, one action for both?

I'd rather have this also as an administrator command rather than as
something automatically run after crash recovery. That way I have my
debugging opportunity, but the admin can remove them without restarting
the server.

Same code, just a Function instead...

Best Regards, Simon Riggs

reference from fd.c: (this is not a patch)

/*
 * Remove temporary files left over from a prior postmaster session
 *
 * This should be called during postmaster startup.  It will forcibly
 * remove any leftover files created by OpenTemporaryFile.
 *
 * NOTE: we could, but don't, call this during a post-backend-crash
restart
 * cycle.  The argument for not doing it is that someone might want to
examine
 * the temp files for debugging purposes.  This does however mean that
 * OpenTemporaryFile had better allow for collision with an existing
temp
 * file name.
 */
void
RemovePgTempFiles(void)



Re: Cleaning up unreferenced table files

From
Bruce Momjian
Date:
FYI, his patch is in CVS and now only _reports_ unreferenced files, and
it happens on recovery and normal startup.

---------------------------------------------------------------------------

Simon Riggs wrote:
> Heikki,
>
> Good patch.
>
> ...The patch makes no mention of temporary files, which are left there
> after a crash.
>
> Temp files are only removed on a normal startup, whereas the patch
> invokes removal on both normal startup and crash recovery. That doesn't
> make much sense...
>
> Also, temp file deletion happens in the postmaster, not at the end of
> recovery in xlog.c.
>
> Could we rationalise those two? One place, one action for both?
>
> I'd rather have this also as an administrator command rather than as
> something automatically run after crash recovery. That way I have my
> debugging opportunity, but the admin can remove them without restarting
> the server.
>
> Same code, just a Function instead...
>
> Best Regards, Simon Riggs
>
> reference from fd.c: (this is not a patch)
>
> /*
>  * Remove temporary files left over from a prior postmaster session
>  *
>  * This should be called during postmaster startup.  It will forcibly
>  * remove any leftover files created by OpenTemporaryFile.
>  *
>  * NOTE: we could, but don't, call this during a post-backend-crash
> restart
>  * cycle.  The argument for not doing it is that someone might want to
> examine
>  * the temp files for debugging purposes.  This does however mean that
>  * OpenTemporaryFile had better allow for collision with an existing
> temp
>  * file name.
>  */
> void
> RemovePgTempFiles(void)
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Cleaning up unreferenced table files

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Applied.

Now that I've had a chance to look at it, this patch is thoroughly
broken.  Problems observed in a quick review:

1. It doesn't work at all for non-default tablespaces: it will
claim that every file in such a tablespace is stale.  The fact
that it does that rather than failing entirely is accidental.
It tries to read the database's pg_class in the target tablespace
whether it's there or not.  Because the system is still in recovery
mode, the low-level routines allow the access to the nonexistent
pg_class table to pass --- in fact they think they should create
the file, so after it runs there's a bogus empty "1259" file in each
such tablespace (which of course it complains about, too).  The code
then proceeds to think that pg_class is empty so of course everything
draws a warning.

2. It's not robust against stale subdirectories of a tablespace
(ie, subdirs corresponding to a nonexistent database) --- again,
it'll try to read a nonexistent pg_class.  Then it'll produce a
bunch of off-target complaint messages.

3. It's assuming that relfilenode is unique database-wide, when no
such assumption is safe.  We only have a guarantee that it's unique
tablespace-wide.

4. It fails to examine table segment files (such as "nnn.1").  These
should be complained of when the "nnn" doesn't match any hash entry.

5. It will load every relfilenode value in pg_class into the hashtable
whether it's meaningful or not.  There should be a check on relkind.

6. I don't think relying on strtol to decide if a filename is entirely
numeric is very safe.  Note all the extra defenses in pg_atoi against
various platform-specific misbehaviors of strtol.  Personally I'd use a
strspn test instead.

7. There are no checks for readdir failure (compare any other readdir
loop in the backend).

See also Simon Riggs' complaints that the circumstances under which it's
done are pretty randomly selected.  (One particular thing that I think
is a bad idea is to do this in a standalone backend.  Any sort of
corruption in any db's pg_class would render it impossible to start up.)

To fix the first three problems, and also avoid the performance problem
of multiply rescanning a database's pg_class for each of its
tablespaces, I would suggest that the hashtable entries be widened to
RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid).  Then
there should be one iteration over pg_database to learn the OIDs and
default tablespaces of each database; with that you can read pg_class
from its correct location for each database and load all the entries
into the hashtable.  Then you iterate through the tablespaces looking
for stuff not present in the hashtable.  You might also want to build a
list or hashtable of known database OIDs, so that you can recognize a
stale subdirectory immediately and issue a direct complaint about it
without even recursing into it.

            regards, tom lane

Re: Cleaning up unreferenced table files

From
Heikki Linnakangas
Date:
On Thu, 5 May 2005, Tom Lane wrote:

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> Applied.
>
> Now that I've had a chance to look at it, this patch is thoroughly
> broken.  Problems observed in a quick review:
>
> 1. It doesn't work at all for non-default tablespaces: it will
> claim that every file in such a tablespace is stale.  The fact
> that it does that rather than failing entirely is accidental.
> It tries to read the database's pg_class in the target tablespace
> whether it's there or not.  Because the system is still in recovery
> mode, the low-level routines allow the access to the nonexistent
> pg_class table to pass --- in fact they think they should create
> the file, so after it runs there's a bogus empty "1259" file in each
> such tablespace (which of course it complains about, too).  The code
> then proceeds to think that pg_class is empty so of course everything
> draws a warning.
>
> 2. It's not robust against stale subdirectories of a tablespace
> (ie, subdirs corresponding to a nonexistent database) --- again,
> it'll try to read a nonexistent pg_class.  Then it'll produce a
> bunch of off-target complaint messages.
>
> 3. It's assuming that relfilenode is unique database-wide, when no
> such assumption is safe.  We only have a guarantee that it's unique
> tablespace-wide.
>
> 4. It fails to examine table segment files (such as "nnn.1").  These
> should be complained of when the "nnn" doesn't match any hash entry.
>
> 5. It will load every relfilenode value in pg_class into the hashtable
> whether it's meaningful or not.  There should be a check on relkind.
>
> 6. I don't think relying on strtol to decide if a filename is entirely
> numeric is very safe.  Note all the extra defenses in pg_atoi against
> various platform-specific misbehaviors of strtol.  Personally I'd use a
> strspn test instead.
>

I'll fix 1-6 according to your suggestions, and send another patch.

It shows how little experience I have with multiple database
and tablespace management.

> 7. There are no checks for readdir failure (compare any other readdir
> loop in the backend).

I couldn't figure out what you meant. The readdir code is the same as
anywhere else. Also, man page (Linux) says that readdir returns NULL on
error, and that is checked.

> See also Simon Riggs' complaints that the circumstances under which it's
> done are pretty randomly selected.  (One particular thing that I think
> is a bad idea is to do this in a standalone backend.  Any sort of
> corruption in any db's pg_class would render it impossible to start up.)

I'd agree with Simons complaints if we actually deleted the files. But
since we only report them, it's a good idea to report them on every
startup, otherwise the DBA might think that the stale files are not there
anymore since the system isn't complaining about them anymore.

The original patch only ran the check on crash recovery, but Bruce changed
it to run on startup as well, for the above reason.

I agree, though, that it's a bad idea to do it in standalone mode. I'll
add a check for that. Also it probably shouldn't stop the startup even if
some pg_class is corrupt. Other databases could be fine.

> To fix the first three problems, and also avoid the performance problem
> of multiply rescanning a database's pg_class for each of its
> tablespaces, I would suggest that the hashtable entries be widened to
> RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid).  Then
> there should be one iteration over pg_database to learn the OIDs and
> default tablespaces of each database; with that you can read pg_class
> from its correct location for each database and load all the entries
> into the hashtable.  Then you iterate through the tablespaces looking
> for stuff not present in the hashtable.  You might also want to build a
> list or hashtable of known database OIDs, so that you can recognize a
> stale subdirectory immediately and issue a direct complaint about it
> without even recursing into it.
>
>             regards, tom lane
>

- Heikki

Re: Cleaning up unreferenced table files

From
Heikki Linnakangas
Date:
Maybe we should take a different approach to the problem:

1. Create new file with an extension to mark that it's not
    yet committed (eg. 1234.notcommitted)
2. ...
3. Take CheckpointStartLock
4. Write commit record to WAL, with list of created files.
5. rename created file (1234.notcommitted -> 1234).
6. Release CheckpointStartLock

This would guarantee that after successful WAL replay, all files in the
data directory with .notcommitted extension can be safely deleted. No need
to read pg_database or pg_class.

We would take a performance hit because of the additional rename and fsync
step. Also, we must somehow make sure that the new file or the directory
it's in is fsynced on checkpoint to make sure that the rename is flushed
to disk.

A variant of the scheme would be to create two files on step 1. One would
be the actual relfile (1234) and the other would an empty marker file
(1234.notcommitted). That way the smgr code wouldn't have to care it the
file is new or not when opening it.

- Heikki

On Thu, 5 May 2005, Tom Lane wrote:

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> Applied.
>
> Now that I've had a chance to look at it, this patch is thoroughly
> broken.  Problems observed in a quick review:
>
> 1. It doesn't work at all for non-default tablespaces: it will
> claim that every file in such a tablespace is stale.  The fact
> that it does that rather than failing entirely is accidental.
> It tries to read the database's pg_class in the target tablespace
> whether it's there or not.  Because the system is still in recovery
> mode, the low-level routines allow the access to the nonexistent
> pg_class table to pass --- in fact they think they should create
> the file, so after it runs there's a bogus empty "1259" file in each
> such tablespace (which of course it complains about, too).  The code
> then proceeds to think that pg_class is empty so of course everything
> draws a warning.
>
> 2. It's not robust against stale subdirectories of a tablespace
> (ie, subdirs corresponding to a nonexistent database) --- again,
> it'll try to read a nonexistent pg_class.  Then it'll produce a
> bunch of off-target complaint messages.
>
> 3. It's assuming that relfilenode is unique database-wide, when no
> such assumption is safe.  We only have a guarantee that it's unique
> tablespace-wide.
>
> 4. It fails to examine table segment files (such as "nnn.1").  These
> should be complained of when the "nnn" doesn't match any hash entry.
>
> 5. It will load every relfilenode value in pg_class into the hashtable
> whether it's meaningful or not.  There should be a check on relkind.
>
> 6. I don't think relying on strtol to decide if a filename is entirely
> numeric is very safe.  Note all the extra defenses in pg_atoi against
> various platform-specific misbehaviors of strtol.  Personally I'd use a
> strspn test instead.
>
> 7. There are no checks for readdir failure (compare any other readdir
> loop in the backend).
>
> See also Simon Riggs' complaints that the circumstances under which it's
> done are pretty randomly selected.  (One particular thing that I think
> is a bad idea is to do this in a standalone backend.  Any sort of
> corruption in any db's pg_class would render it impossible to start up.)
>
> To fix the first three problems, and also avoid the performance problem
> of multiply rescanning a database's pg_class for each of its
> tablespaces, I would suggest that the hashtable entries be widened to
> RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid).  Then
> there should be one iteration over pg_database to learn the OIDs and
> default tablespaces of each database; with that you can read pg_class
> from its correct location for each database and load all the entries
> into the hashtable.  Then you iterate through the tablespaces looking
> for stuff not present in the hashtable.  You might also want to build a
> list or hashtable of known database OIDs, so that you can recognize a
> stale subdirectory immediately and issue a direct complaint about it
> without even recursing into it.
>
>             regards, tom lane
>

- Heikki

Re: Cleaning up unreferenced table files

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> Maybe we should take a different approach to the problem:
> 1. Create new file with an extension to mark that it's not
>     yet committed (eg. 1234.notcommitted)

This is pushing the problem into the wrong place, viz the lowest-level
file access routines, which will now all have to know about
.notcommitted status.  It also creates race conditions --- think about
backend A trying to commit file 1234 at about the same time that
backend B is trying to flush some dirty buffers belonging to that file.
But most importantly, it doesn't handle the file-deletion case.

            regards, tom lane